Pod could not be rotated due to under-replicated partition
See original GitHub issueDescribe the bug Unable to rotate pod on strimzi upgrade due to not expected under-replicated error.
To Reproduce
- Upgrade From 0.25 to 0.26.1
- Random error difficult to reproduce
Expected behavior Hello, When upgrading to version 0.26.1, we had several cases of pod rollout blocked due to under-replication error when it doesn’t seem to be expected according to the configurations in place. The problem occurred with random topics but also with the consumer-offsets topic.
Environment (please complete the following information):
- Strimzi version: 0.26.1
- Installation method: Helm chart
- Kubernetes cluster: Kubernetes 1.20
- Infrastructure: Amazon EKS
YAML files and logs
Topics Cluster conf:
config:
auto.create.topics.enable: 'false'
num.partitions: 12
default.replication.factor: 3
min.insync.replicas: 1
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 1
Topics conf for one the topics where we had this issues
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: app-to-websocket
namespace: kafka-applications
labels:
strimzi.io/cluster: kafka
spec:
partitions: 12
replicas: 3
config:
retention.ms: "3600000" # 1H
segment.ms: "300000" # 5mn
and for __consumer_offsets (default one)
partitions: 50
replicas: 3
2021-12-16 16:05:21 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod kafka-metapro-kafka-0 is currently the controller and there are other pods still to roll, retrying after at least 250ms
2021-12-16 16:05:22 INFO AbstractOperator:466 - Reconciliation #69(timer) Kafka(kafka-customers-logs/kafka): reconciled
2021-12-16 16:05:22 INFO AbstractOperator:466 - Reconciliation #68(timer) Kafka(kafka-applications/kafka): reconciled
2021-12-16 16:05:22 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:22 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 500ms
2021-12-16 16:05:22 INFO AbstractOperator:466 - Reconciliation #71(timer) Kafka(kafka-systems-logs/kafka): reconciled
2021-12-16 16:05:22 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:22 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 1000ms
2021-12-16 16:05:24 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:24 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 2000ms
2021-12-16 16:05:26 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:26 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 4000ms
2021-12-16 16:05:30 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:30 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 8000ms
2021-12-16 16:05:38 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:38 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 16000ms
2021-12-16 16:05:54 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be 2021-12-16 16:05:21 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Pod kafka-metapro-kafka-0 is currently the controller and there are other pods still to roll, retrying after at least 250ms
2021-12-16 16:05:22 INFO AbstractOperator:466 - Reconciliation #69(timer) Kafka(kafka-customers-logs/kafka): reconciled
2021-12-16 16:05:22 INFO AbstractOperator:466 - Reconciliation #68(timer) Kafka(kafka-applications/kafka): reconciled
2021-12-16 16:05:22 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:22 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 500ms
2021-12-16 16:05:22 INFO AbstractOperator:466 - Reconciliation #71(timer) Kafka(kafka-systems-logs/kafka): reconciled
2021-12-16 16:05:22 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:22 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 1000ms
2021-12-16 16:05:24 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:24 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 2000ms
2021-12-16 16:05:26 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:26 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 4000ms
2021-12-16 16:05:30 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:30 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 8000ms
2021-12-16 16:05:38 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:38 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 16000ms
2021-12-16 16:05:54 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:54 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 32000ms
2021-12-16 16:06:16 INFO AbstractOperator:363 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Reconciliation is in progress
2021-12-16 16:06:27 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:06:27 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 64000ms (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:05:54 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 32000ms
2021-12-16 16:06:16 INFO AbstractOperator:363 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Reconciliation is in progress
2021-12-16 16:06:27 INFO KafkaAvailability:135 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): __consumer_offsets/27 will be under-replicated (ISR={0}, replicas=[0,4,5], min.insync.replicas=1) if broker 0 is restarted.
2021-12-16 16:06:27 INFO KafkaRoller:299 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable, retrying after at least 64000ms
2021-12-16 16:07:31 INFO KafkaRoller:292 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): Could not roll pod 0, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:370) ~[io.strimzi.cluster-operator-0.26.1.jar:0.26.1]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:277) ~[io.strimzi.cluster-operator-0.26.1.jar:0.26.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2021-12-16 16:07:31 ERROR AbstractOperator:240 - Reconciliation #70(timer) Kafka(kafka-metapro/kafka-metapro): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod kafka-metapro-kafka-0 is currently not rollable
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:370) ~[io.strimzi.cluster-operator-0.26.1.jar:0.26.1]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:277) ~[io.strimzi.cluster-operator-0.26.1.jar:0.26.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Additional context Migration from 0.25 to 0.26.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Fixing under replicated partitions in kafka - Stack Overflow
We faced the same issue: Solution was: Restart the Zookeeper leader. Restart the broker\brokers that are not replicating some of the ...
Read more >Troubleshoot Cluster Setup | CockroachDB Docs
Non-release builds of CockroachDB may not be able to run on older hardware ... When a CockroachDB node dies (or is partitioned) the...
Read more >The Expert's Guide to Running Apache Kafka on Kubernetes
Storing messages and records. Records published to Kafka can be stored in partitioned append- only logs distributed across the cluster for fault-tolerance.
Read more >Run a Replicated Stateful Application - Kubernetes
If you do not already have a cluster, you can create one by using minikube ... server even if it gets a new...
Read more >SSpike in Under Replicated and Offline Partition while ...
Deleting just a zookeeper follower replica does not lead to the urp spike • The zookeeper leader comes back up correctly after a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello, I’m putting a quick feedback to close the issue. We identified that the internal topics
__strimzi-topic-operator-kstreams-topic-store-changelog
replicas was set to 1, i think this settings come from a previous version (0.21->0.22 ?); so we set replicas to 3 and did repartitioning.I don’t know if it was the root cause, but we didn’t identified any other sync issue on the migration we did after this change.
Thanks
Well, you are right that I would not necessarily expect the replicas to not be in sync. All I meant was that the operator algorithm seems to work as intended here.
But I do now really know the cluster, so it is hard for me to speculate about the reasons. It could have been out of sync already before. Or it could be related to a recent restart - but from my experience, that usually syncs-up fairly quickly for the consumer offset topics. It could be also just a slow networking and so on. So it is hard to say about what is causing it.
Kafka logs might say more. I’m not really an expert on Kafka itself, so do not have any pointers to what exactly to look for. But maybe if you share the logs, others might have some idea.