Restart loop after unclean shutdown
See original GitHub issueDescribe the bug A hardware failure brought one of our brokers down and strimizi kept restarting it which made it impossible for the broker to recover the unflushed log segments.
To Reproduce You could probably reproduce this issue by sending a sigkill to a broker while running it with a very slow disk. The core issue is that if the broker takes long enough to recover strimizi will force restart it.
Expected behavior Strimizi should not force roll a broker while unflushed log segments are being recovered
Environment (please complete the following information):
- Strimzi version: 0.23
- Installation method: YAML
- Kubernetes cluster: v1.19.8
- Infrastructure: Amazon EKS
YAML files and logs
2021-07-06 21:53:51 INFO AbstractOperator:255 - Reconciliation #5815(timer) Kafka(new-kafka/main): Kafka main will be checked for creation or modification
2021-07-06 21:54:23 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-07-06 21:54:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:54:54 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-07-06 21:55:24 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 1000ms
2021-07-06 21:55:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:55:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:55:55 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 2000ms
2021-07-06 21:56:27 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 4000ms
2021-07-06 21:56:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:57:01 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 8000ms
2021-07-06 21:57:39 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 16000ms
2021-07-06 21:57:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:57:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:58:25 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2021-07-06 21:58:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:59:27 INFO KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2021-07-06 21:59:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:59:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:00:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:01:01 INFO KafkaRoller:508 - Reconciliation #5815(timer) Kafka(new-kafka/main): Pod 3 needs to be restarted. Reason: []
2021-07-06 22:01:42 WARN KafkaRoller:386 - Reconciliation #5815(timer) Kafka(new-kafka/main): Pod main-kafka-3 will be force-rolled, due to error: Call(callName=listNodes, deadlineMs=1625608901805, tries=1, nextAllowedTryMs=1625608901906) timed out at 1625608901806 after 1 attempt(s)
2021-07-06 22:01:42 INFO PodOperator:65 - Rolling update of new-kafka/main-kafka: Rolling pod main-kafka-3
2021-07-06 22:01:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:01:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:02:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:03:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:03:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:04:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:05:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:05:51 INFO AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:06:42 INFO KafkaRoller:293 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
... 9 more
2021-07-06 22:06:42 ERROR AbstractOperator:276 - Reconciliation #5815(timer) Kafka(new-kafka/main): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
... 9 more
2021-07-06 22:06:42 INFO OperatorWatcher:40 - Reconciliation #5822(watch) Kafka(new-kafka/main): Kafka main in namespace new-kafka was MODIFIED
2021-07-06 22:06:42 INFO CrdOperator:108 - Status of Kafka main in namespace new-kafka has been updated
2021-07-06 22:06:42 WARN AbstractOperator:516 - Reconciliation #5815(timer) Kafka(new-kafka/main): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
... 9 more
2021-07-06 22:06:42 INFO AbstractOperator:255 - Reconciliation #5822(watch) Kafka(new-kafka/main): Kafka main will be checked for creation or modification
2021-07-06 22:07:14 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-07-06 22:07:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:07:45 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-07-06 22:07:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:08:15 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 1000ms
2021-07-06 22:08:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:08:46 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 2000ms
2021-07-06 22:09:18 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 4000ms
2021-07-06 22:09:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:09:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:09:52 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 8000ms
2021-07-06 22:10:30 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 16000ms
2021-07-06 22:10:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:11:16 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2021-07-06 22:11:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:11:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:12:18 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2021-07-06 22:12:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:13:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:13:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:13:52 INFO KafkaRoller:508 - Reconciliation #5822(watch) Kafka(new-kafka/main): Pod 3 needs to be restarted. Reason: []
2021-07-06 22:14:33 WARN KafkaRoller:386 - Reconciliation #5822(watch) Kafka(new-kafka/main): Pod main-kafka-3 will be force-rolled, due to error: Call(callName=listNodes, deadlineMs=1625609672873, tries=1, nextAllowedTryMs=1625609672974) timed out at 1625609672874 after 1 attempt(s)
2021-07-06 22:14:33 INFO PodOperator:65 - Rolling update of new-kafka/main-kafka: Rolling pod main-kafka-3
2021-07-06 22:14:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:15:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:15:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:16:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:17:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:17:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:18:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:19:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:19:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:20:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:21:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:21:51 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:22:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:23:42 INFO AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:23:46 INFO KafkaRoller:284 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not restart pod 3, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
... 10 more
2021-07-06 22:23:46 ERROR AbstractOperator:276 - Reconciliation #5822(watch) Kafka(new-kafka/main): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
... 10 more
2021-07-06 22:23:46 ERROR Util:136 - Exceeded timeout of 300000ms while waiting for Pods resource main-kafka-3 in namespace new-kafka to be ready
2021-07-06 22:23:46 WARN KafkaRoller:807 - Reconciliation #5822(watch) Kafka(new-kafka/main): Error waiting for pod new-kafka/main-kafka-3 to become ready: io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource main-kafka-3 in namespace new-kafka to be ready
2021-07-06 22:23:46 INFO CrdOperator:108 - Status of Kafka main in namespace new-kafka has been updated
2021-07-06 22:23:46 INFO OperatorWatcher:40 - Reconciliation #5831(watch) Kafka(new-kafka/main): Kafka main in namespace new-kafka was MODIFIED
2021-07-06 22:23:46 WARN AbstractOperator:516 - Reconciliation #5822(watch) Kafka(new-kafka/main): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
... 10 more
2021-07-06 22:23:46 INFO AbstractOperator:255 - Reconciliation #5831(watch) Kafka(new-kafka/main): Kafka main will be checked for creation or modification
Additional context
It seems that the main cause is that https://github.com/strimzi/strimzi-kafka-operator/blob/4ab518535d1ef4feefac38845c594fa3520adc9c/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java#L521 fails when broker is recovering the log segments.
2021-07-06 22:07:14 INFO KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
This may also be caused by some corner case in AdminClient and not necessarily be linked to recover phase of the cluster
Issue Analytics
- State:
- Created 2 years ago
- Comments:20 (18 by maintainers)
Top GitHub Comments
Triaged on 21.7.2022: The plan is to improve the KafkaRoller and the agent in Kafka to provide more details about the current Kafka state. That will allow us to handle this issue in a better way. This should be kept opened, proposal needed.