skip to Main Content

I’ve 3 brokers containers and 1 zookeeper container running in a docker stack, both zookeeper container and brokers containers keeps stopping after some days (less than a week) running in an idle state.

This is one of the brokers logs where i see an error, but i can not identify how to handle it

[2022-07-13 02:22:30,109] INFO [UnifiedLog partition=messages-processed-0, dir=/bitnami/kafka/data] Truncating to 6 has no effect as the largest offset in the log is 5 (kafka.log.UnifiedLog)
[2022-07-13 02:23:33,766] INFO [Controller id=1] Newly added brokers: , deleted brokers: 2, bounced brokers: , all live brokers: 1,3 (kafka.controller.KafkaController)
[2022-07-13 02:23:33,766] INFO [RequestSendThread controllerId=1] Shutting down (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Stopped (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Shutdown completed (kafka.controller.RequestSendThread)
[2022-07-13 02:23:34,226] INFO [Controller id=1] Broker failure callback for 2 (kafka.controller.KafkaController)
[2022-07-13 02:23:34,227] INFO [Controller id=1 epoch=3] Sending UpdateMetadata request to brokers Set() for 0 partitions (state.change.logger)
[2022-07-13 02:23:36,414] ERROR [Controller id=1 epoch=3] Controller 1 epoch 3 failed to change state for partition __consumer_offsets-30 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-30 under strategy OfflinePartitionLeaderElectionStrategy(false)
    at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:424)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:421)
    at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:332)
    at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:238)
    at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:158)
    at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:74)
    at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:59)
    at kafka.controller.KafkaController.onReplicasBecomeOffline(KafkaController.scala:627)
    at kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:597)
    at kafka.controller.KafkaController.processBrokerChange(KafkaController.scala:1621)
    at kafka.controller.KafkaController.process(KafkaController.scala:2495)
    at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:52)
    at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:130)
    at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:133)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
    at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:133)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

This is a part of the zookeeper log at the same time

2022-07-13 02:14:45,002 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760000, timeout of 18000ms exceeded
2022-07-13 02:15:13,001 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760007, timeout of 18000ms exceeded
2022-07-13 02:15:29,832 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760006 for client /172.18.0.5:55356, probably expired
2022-07-13 02:15:42,419 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760000 for client /172.18.0.4:59474, probably expired
2022-07-13 02:15:52,350 [myid:] - INFO  [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760007 for client /172.18.0.6:34406, probably expired
2022-07-13 02:16:49,001 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000b, timeout of 18000ms exceeded
2022-07-13 02:17:12,434 [myid:] - INFO  [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000b for client /172.18.0.5:56264, probably expired
2022-07-13 02:19:17,067 [myid:] - WARN  [NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:60150, session = 0x10000fba776000d
    at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
    at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
    at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-13 02:23:29,002 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000e, timeout of 18000ms exceeded
2022-07-13 02:24:05,059 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000e for client /172.18.0.5:32886, probably expired
2022-07-13 03:48:55,209 [myid:] - WARN  [NIOWorkerThread-2:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:33508, session = 0x10000fba776000d
    at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
    at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
    at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)


I’ve seen that a maxSessionTimeout of 40000ms is set on my zoo.cfg (zookeeper side) and a timeout of 18000ms is set on the server.properties at broker side, should i increment one of those?

**kafka-topic –describe ** about one of the topics that fell https://prnt.sc/r3hU5wv3jK-h

Image used for the broker container

2

Answers


  1. You can try the following.

    docker volume rm sentry-kafka
    docker volume rm sentry-zookeeper
    docker volume rm sentry_onpremise_sentry-kafka-log
    docker volume rm sentry_onpremise_sentry-zookeeper-log
    
    
    ./install.sh // to create Kafka partitions
    
    Login or Signup to reply.
  2. In your uploaded image, it appears brokerId=1 has crashed on its own. And probably not related to the logs you’ve shown.

    The logs that you did show ("Failed to elect leader") are happening on the other brokers because ReplicationFactor: 1 on __consumer_offsets. Thus why you have Leader: None, Replicas: 1. In a healthy Kafka cluster, no partition leaders should be none.

    This can be fixed for an existing Kafka cluster using kafka-reassign-partitions, but since you are using Docker, the best way to fix this would be to first clear all the data, stop the containers, then add your missing KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR=3 environment variable and restart them all.

    Or as mentioned in the comments, don’t use 3 brokers on a single host. And if you are using a single host, timeouts aren’t really relevant because all network requests are local.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search