We are having SOLR (8.3.1) CLOUD (NRT) with Zookeeper Ensemble , 3 nodes
each on Centos VMs
SOLR Nodes has 66GB RAM, 15GB HEAP MEM, 4 CPUs.
Record Count: 3.3Million. Avg Doc Size is 350Kb.
Everything works fine until some disturbance happens with the cluser, due to load or network latancy issues. The threads in TIMED_WAITING increase to 7000+ and it stays until SOLR restart
Server 1:
7722 Threads are in TIMED_WATING
Server 2:
4046 Threads are in TIMED_WATING
Server 3:
4210 Threads are in TIMED_WATING
How to increase the 3000 to something bigger? will net.ipv4.tcp_tw_reuse=1 helps? what is the drawback? Please help.
Validate System time/NTP Sync during error window. It might be one of the root cause. Also, watch for the explicit client’s commits.
One of possible workaround is switch to http1 (solr option