We are having SOLR (8.3.1) CLOUD (NRT) with Zookeeper Ensemble , 3 nodes
each on Centos VMs
SOLR Nodes has 66GB RAM, 15GB HEAP MEM, 4 CPUs.
Record Count: 3.3Million. Avg Doc Size is 350Kb.
Everything works fine until some disturbance happens with the cluser, due to load or network latancy issues. The threads in TIMED_WAITING increase to 7000+ and it stays until SOLR restart
Server 1:
7722 Threads are in TIMED_WATING
("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@151d5f2f
")
Server 2:
4046 Threads are in TIMED_WATING
("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1e0205c3
")
Server 3:
4210 Threads are in TIMED_WATING
("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5ee792c0
")
How to increase the 3000 to something bigger? will net.ipv4.tcp_tw_reuse=1 helps? what is the drawback? Please help.
2
Answers
Validate System time/NTP Sync during error window. It might be one of the root cause. Also, watch for the explicit client’s commits.
One of possible workaround is switch to http1 (solr option
-Dsolr.http1
)