I have a 3 box Solr cloud setup with ZooKeeper, each server has a Solr and ZK install (not perfect I know). Everything was working fine until a network outage this morning.
Post outage boxes A and C came back as expected. Box B did not, a restart of the Solr service revealed an error which states
A previous ephemeral live node still exists. Solr cannot continue.
Upon looking in the B node ZooKeeper Live_Nodes
path the Solr install is already showing as an active live node even though Solr is off. This node is not shown on boxes A and B within the Live_nodes
path. I’m also unable to delete
or rmr
this node because ZooKeeper is telling that it doesn’t exist.
I have attempted Solr stop -all
in case there was a hidden process that I wasn’t seeing but Solr states that there are no instances running.
Next move was installing a fresh ZooKeeper instance on B. After that was up a ls /live_nodes
continues showing this solr instance that doesn’t exist.
Any help is appreciated. Thank you.
2
Answers
FYI, I continued troubleshooting and eventually rebuilt all 3 ZooKeeper nodes. That led me to a separate error of showing that the collection shard was broken. After troubleshooting the 'clusterstate.json' file, what ended up being the fix was creating a duplicate collection with a separate name and then an alias for redirecting traffic. After this I was able to delete the broken collection.
I'm thinking a duplicate collection and alias would have fixed it whole time.
Hopefully this helps someone in the future. Thanks.
We had a similar issue recently and were able to delete the data from /solr/live_nodes by doing the following listed below and then solr was able to start up and get past the issue from OP.
Adding this as hope it will help someone else in the future.
Example data ZK shell in /solr/live_nodes:
Create the solr nodes again (fails with Node already exists):
Set some data on the nodes:
Delete the nodes:
After that we were able to start up solr and that issue was resolved and /solr/live_nodes was repopulated.