I’ve 3 brokers containers and 1 zookeeper container running in a docker stack, both zookeeper container and brokers containers keeps stopping after some days (less than a week) running in an idle state.
This is one of the brokers logs where i see an error, but i can not identify how to handle it
[2022-07-13 02:22:30,109] INFO [UnifiedLog partition=messages-processed-0, dir=/bitnami/kafka/data] Truncating to 6 has no effect as the largest offset in the log is 5 (kafka.log.UnifiedLog)
[2022-07-13 02:23:33,766] INFO [Controller id=1] Newly added brokers: , deleted brokers: 2, bounced brokers: , all live brokers: 1,3 (kafka.controller.KafkaController)
[2022-07-13 02:23:33,766] INFO [RequestSendThread controllerId=1] Shutting down (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Stopped (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Shutdown completed (kafka.controller.RequestSendThread)
[2022-07-13 02:23:34,226] INFO [Controller id=1] Broker failure callback for 2 (kafka.controller.KafkaController)
[2022-07-13 02:23:34,227] INFO [Controller id=1 epoch=3] Sending UpdateMetadata request to brokers Set() for 0 partitions (state.change.logger)
[2022-07-13 02:23:36,414] ERROR [Controller id=1 epoch=3] Controller 1 epoch 3 failed to change state for partition __consumer_offsets-30 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-30 under strategy OfflinePartitionLeaderElectionStrategy(false)
at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:424)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:421)
at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:332)
at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:238)
at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:158)
at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:74)
at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:59)
at kafka.controller.KafkaController.onReplicasBecomeOffline(KafkaController.scala:627)
at kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:597)
at kafka.controller.KafkaController.processBrokerChange(KafkaController.scala:1621)
at kafka.controller.KafkaController.process(KafkaController.scala:2495)
at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:52)
at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:130)
at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:133)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:133)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
This is a part of the zookeeper log at the same time
2022-07-13 02:14:45,002 [myid:] - INFO [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760000, timeout of 18000ms exceeded
2022-07-13 02:15:13,001 [myid:] - INFO [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760007, timeout of 18000ms exceeded
2022-07-13 02:15:29,832 [myid:] - INFO [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760006 for client /172.18.0.5:55356, probably expired
2022-07-13 02:15:42,419 [myid:] - INFO [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760000 for client /172.18.0.4:59474, probably expired
2022-07-13 02:15:52,350 [myid:] - INFO [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760007 for client /172.18.0.6:34406, probably expired
2022-07-13 02:16:49,001 [myid:] - INFO [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000b, timeout of 18000ms exceeded
2022-07-13 02:17:12,434 [myid:] - INFO [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000b for client /172.18.0.5:56264, probably expired
2022-07-13 02:19:17,067 [myid:] - WARN [NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:60150, session = 0x10000fba776000d
at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-13 02:23:29,002 [myid:] - INFO [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000e, timeout of 18000ms exceeded
2022-07-13 02:24:05,059 [myid:] - INFO [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000e for client /172.18.0.5:32886, probably expired
2022-07-13 03:48:55,209 [myid:] - WARN [NIOWorkerThread-2:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:33508, session = 0x10000fba776000d
at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
I’ve seen that a maxSessionTimeout of 40000ms is set on my zoo.cfg (zookeeper side) and a timeout of 18000ms is set on the server.properties at broker side, should i increment one of those?
**kafka-topic –describe ** about one of the topics that fell https://prnt.sc/r3hU5wv3jK-h
2
Answers
You can try the following.
In your uploaded image, it appears
brokerId=1
has crashed on its own. And probably not related to the logs you’ve shown.The logs that you did show ("Failed to elect leader") are happening on the other brokers because
ReplicationFactor: 1
on__consumer_offsets
. Thus why you haveLeader: None, Replicas: 1
. In a healthy Kafka cluster, no partition leaders should be none.This can be fixed for an existing Kafka cluster using
kafka-reassign-partitions
, but since you are using Docker, the best way to fix this would be to first clear all the data, stop the containers, then add your missingKAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR=3
environment variable and restart them all.Or as mentioned in the comments, don’t use 3 brokers on a single host. And if you are using a single host, timeouts aren’t really relevant because all network requests are local.