question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Restart loop after unclean shutdown

See original GitHub issue

Describe the bug A hardware failure brought one of our brokers down and strimizi kept restarting it which made it impossible for the broker to recover the unflushed log segments.

To Reproduce You could probably reproduce this issue by sending a sigkill to a broker while running it with a very slow disk. The core issue is that if the broker takes long enough to recover strimizi will force restart it.

Expected behavior Strimizi should not force roll a broker while unflushed log segments are being recovered

Environment (please complete the following information):

  • Strimzi version: 0.23
  • Installation method: YAML
  • Kubernetes cluster: v1.19.8
  • Infrastructure: Amazon EKS

YAML files and logs

2021-07-06 21:53:51 INFO  AbstractOperator:255 - Reconciliation #5815(timer) Kafka(new-kafka/main): Kafka main will be checked for creation or modification
2021-07-06 21:54:23 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-07-06 21:54:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:54:54 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-07-06 21:55:24 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 1000ms
2021-07-06 21:55:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:55:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:55:55 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 2000ms
2021-07-06 21:56:27 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 4000ms
2021-07-06 21:56:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:57:01 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 8000ms
2021-07-06 21:57:39 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 16000ms
2021-07-06 21:57:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:57:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:58:25 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2021-07-06 21:58:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:59:27 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2021-07-06 21:59:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:59:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:00:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:01:01 INFO  KafkaRoller:508 - Reconciliation #5815(timer) Kafka(new-kafka/main): Pod 3 needs to be restarted. Reason: []
2021-07-06 22:01:42 WARN  KafkaRoller:386 - Reconciliation #5815(timer) Kafka(new-kafka/main): Pod main-kafka-3 will be force-rolled, due to error: Call(callName=listNodes, deadlineMs=1625608901805, tries=1, nextAllowedTryMs=1625608901906) timed out at 1625608901806 after 1 attempt(s)
2021-07-06 22:01:42 INFO  PodOperator:65 - Rolling update of new-kafka/main-kafka: Rolling pod main-kafka-3
2021-07-06 22:01:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:01:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:02:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:03:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:03:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:04:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:05:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:05:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:06:42 INFO  KafkaRoller:293 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 9 more
2021-07-06 22:06:42 ERROR AbstractOperator:276 - Reconciliation #5815(timer) Kafka(new-kafka/main): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 9 more
2021-07-06 22:06:42 INFO  OperatorWatcher:40 - Reconciliation #5822(watch) Kafka(new-kafka/main): Kafka main in namespace new-kafka was MODIFIED
2021-07-06 22:06:42 INFO  CrdOperator:108 - Status of Kafka main in namespace new-kafka has been updated
2021-07-06 22:06:42 WARN  AbstractOperator:516 - Reconciliation #5815(timer) Kafka(new-kafka/main): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 9 more
2021-07-06 22:06:42 INFO  AbstractOperator:255 - Reconciliation #5822(watch) Kafka(new-kafka/main): Kafka main will be checked for creation or modification
2021-07-06 22:07:14 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-07-06 22:07:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:07:45 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-07-06 22:07:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:08:15 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 1000ms
2021-07-06 22:08:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:08:46 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 2000ms
2021-07-06 22:09:18 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 4000ms
2021-07-06 22:09:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:09:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:09:52 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 8000ms
2021-07-06 22:10:30 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 16000ms
2021-07-06 22:10:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:11:16 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2021-07-06 22:11:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:11:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:12:18 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2021-07-06 22:12:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:13:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:13:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:13:52 INFO  KafkaRoller:508 - Reconciliation #5822(watch) Kafka(new-kafka/main): Pod 3 needs to be restarted. Reason: []
2021-07-06 22:14:33 WARN  KafkaRoller:386 - Reconciliation #5822(watch) Kafka(new-kafka/main): Pod main-kafka-3 will be force-rolled, due to error: Call(callName=listNodes, deadlineMs=1625609672873, tries=1, nextAllowedTryMs=1625609672974) timed out at 1625609672874 after 1 attempt(s)
2021-07-06 22:14:33 INFO  PodOperator:65 - Rolling update of new-kafka/main-kafka: Rolling pod main-kafka-3
2021-07-06 22:14:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:15:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:15:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:16:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:17:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:17:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:18:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:19:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:19:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:20:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:21:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:21:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:22:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:23:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:23:46 INFO  KafkaRoller:284 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not restart pod 3, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 10 more
2021-07-06 22:23:46 ERROR AbstractOperator:276 - Reconciliation #5822(watch) Kafka(new-kafka/main): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 10 more
2021-07-06 22:23:46 ERROR Util:136 - Exceeded timeout of 300000ms while waiting for Pods resource main-kafka-3 in namespace new-kafka to be ready
2021-07-06 22:23:46 WARN  KafkaRoller:807 - Reconciliation #5822(watch) Kafka(new-kafka/main): Error waiting for pod new-kafka/main-kafka-3 to become ready: io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource main-kafka-3 in namespace new-kafka to be ready
2021-07-06 22:23:46 INFO  CrdOperator:108 - Status of Kafka main in namespace new-kafka has been updated
2021-07-06 22:23:46 INFO  OperatorWatcher:40 - Reconciliation #5831(watch) Kafka(new-kafka/main): Kafka main in namespace new-kafka was MODIFIED
2021-07-06 22:23:46 WARN  AbstractOperator:516 - Reconciliation #5822(watch) Kafka(new-kafka/main): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 10 more
2021-07-06 22:23:46 INFO  AbstractOperator:255 - Reconciliation #5831(watch) Kafka(new-kafka/main): Kafka main will be checked for creation or modification

Additional context

It seems that the main cause is that https://github.com/strimzi/strimzi-kafka-operator/blob/4ab518535d1ef4feefac38845c594fa3520adc9c/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java#L521 fails when broker is recovering the log segments.

2021-07-06 22:07:14 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms

This may also be caused by some corner case in AdminClient and not necessarily be linked to recover phase of the cluster

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:20 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
maowernercommented, Aug 1, 2022

Are there any manual workarounds for this, or extended timeouts we can apply? We managed to work around the problem by setting strimzi-kafka-operator.operationTimeoutMs to a value that is larger than the restart time after unclean shutdown. 20 minutes worked quite well for us.

0reactions
scholzjcommented, Jul 21, 2022

Triaged on 21.7.2022: The plan is to improve the KafkaRoller and the agent in Kafka to provide more details about the current Kafka state. That will allow us to handle this issue in a better way. This should be kept opened, proposal needed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Computer stuck in reboot loop after accidental shutdown ...
I am currently stuck in a perpetual boot after my computer shut down during an update. I am unable to use restoration media...
Read more >
Complete recovery failure after unclean shutdown
After an unclean shutdown a secondary never recovers on it's own, never making it past the final step in the following sample log, ......
Read more >
How to Fix a Windows 10 Restart Loop - Help Desk Geek
If Windows 10 is still stuck in a restart loop after unplugging ... and then enter shutdown /r /o to restart in the...
Read more >
HAOS won't boot after unclean VM shutdown
On restart, the HAOS VM went through a rapid repeating loop of start-up, fail to start Docker, shut-down, repeat. After a while it...
Read more >
Root Cause Analysis of CVM Reboots - Nutanix Support Portal
"Restart Guest OS" on CVM initiated from vCentre results in the following ... in the vmware.log if the CVM has been gracefully shutdown...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found