skip to Main Content

I have a 3 box Solr cloud setup with ZooKeeper, each server has a Solr and ZK install (not perfect I know). Everything was working fine until a network outage this morning.

Post outage boxes A and C came back as expected. Box B did not, a restart of the Solr service revealed an error which states
A previous ephemeral live node still exists. Solr cannot continue.

Upon looking in the B node ZooKeeper Live_Nodes path the Solr install is already showing as an active live node even though Solr is off. This node is not shown on boxes A and B within the Live_nodes path. I’m also unable to delete or rmr this node because ZooKeeper is telling that it doesn’t exist.

I have attempted Solr stop -all in case there was a hidden process that I wasn’t seeing but Solr states that there are no instances running.

Next move was installing a fresh ZooKeeper instance on B. After that was up a ls /live_nodes continues showing this solr instance that doesn’t exist.

Any help is appreciated. Thank you.

2

Answers


  1. Chosen as BEST ANSWER

    FYI, I continued troubleshooting and eventually rebuilt all 3 ZooKeeper nodes. That led me to a separate error of showing that the collection shard was broken. After troubleshooting the 'clusterstate.json' file, what ended up being the fix was creating a duplicate collection with a separate name and then an alias for redirecting traffic. After this I was able to delete the broken collection.

    I'm thinking a duplicate collection and alias would have fixed it whole time.

    Hopefully this helps someone in the future. Thanks.


  2. We had a similar issue recently and were able to delete the data from /solr/live_nodes by doing the following listed below and then solr was able to start up and get past the issue from OP.

    Adding this as hope it will help someone else in the future.

    Example data ZK shell in /solr/live_nodes:

    [solr.node1.sp.local:8983_solr, solr.node2.sp.local:8983_solr]
    

    Create the solr nodes again (fails with Node already exists):

    create /solr/live_nodes/solr.node1.sp.local:8983_solr 
    create /solr/live_nodes/solr.node2.sp.local:8983_solr
    

    Set some data on the nodes:

    set /solr/live_nodes/solr.node1.sp.local:8983_solr "hello" 
    set /solr/live_nodes/solr.node2.sp.local:8983_solr "hello" 
    

    Delete the nodes:

    delete /solr/live_nodes/solr.node1.sp.local:8983_solr 
    delete /solr/live_nodes/solr.node1.sp.local:8983_solr
    

    After that we were able to start up solr and that issue was resolved and /solr/live_nodes was repopulated.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search