[Bug] Kafka brokers stuck in a Rolling update loop
See original GitHub issueDescribe the bug
From the operator logs we see:
2020-11-24 09:24:16 INFO PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-0
2020-11-24 09:24:41 INFO PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-1
2020-11-24 09:25:11 INFO AbstractOperator:455 - Reconciliation #54(timer) Kafka(development/kafka-default-development): reconciled
2020-11-24 09:26:13 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace experimental...
2020-11-24 09:26:13 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace production...
2020-11-24 09:26:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace kube-system...
2020-11-24 09:26:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace staging...
2020-11-24 09:26:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace development...
2020-11-24 09:26:14 INFO AbstractOperator:217 - Reconciliation #55(timer) Kafka(development/kafka-default-development): Kafka kafka-default-development will be checked for creation or modification
2020-11-24 09:26:16 INFO PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-0
2020-11-24 09:26:44 INFO PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-1
2020-11-24 09:27:15 INFO AbstractOperator:455 - Reconciliation #55(timer) Kafka(development/kafka-default-development): reconciled
2020-11-24 09:28:13 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace experimental...
2020-11-24 09:28:13 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace production...
2020-11-24 09:28:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace kube-system...
2020-11-24 09:28:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace staging...
2020-11-24 09:28:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace development...
2020-11-24 09:28:14 INFO AbstractOperator:217 - Reconciliation #56(timer) Kafka(development/kafka-default-development): Kafka kafka-default-development will be checked for creation or modification
2020-11-24 09:28:16 INFO PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-0
2020-11-24 09:28:46 INFO PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-1
2020-11-24 09:29:18 INFO AbstractOperator:455 - Reconciliation #56(timer) Kafka(development/kafka-default-development): reconciled
2020-11-24 09:30:13 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace experimental...
2020-11-24 09:30:13 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace production...
2020-11-24 09:30:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace kube-system...
2020-11-24 09:30:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace staging...
2020-11-24 09:30:14 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace development...
2020-11-24 09:30:14 INFO AbstractOperator:217 - Reconciliation #57(timer) Kafka(development/kafka-default-development): Kafka kafka-default-development will be checked for creation or modification
2020-11-24 09:30:16 INFO PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-0
Kafka brokers get restarted every ~2 minutes.
To Reproduce Steps to reproduce the behavior:
- Install Operator as:
---
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
name: strimzi
namespace: kube-system
spec:
chart:
repository: https://strimzi.io/charts/
name: strimzi-kafka-operator
version: 0.20.0
releaseName: strimzi
forceUpgrade: true
values:
watchNamespaces:
- development
- production
- staging
- experimental
- Add Kafka cluster as:
apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
name: kafka-default-development
namespace: development
spec:
kafka:
config:
auto.create.topics.enable: "true"
default.replication.factor: 2
log.retention.check.interval.ms: 300000
log.retention.hours: 2
num.partitions: 20
offsets.topic.replication.factor: 2
transaction.state.log.replication.factor: 2
jvmOptions:
-Xms: 2048m
-Xmx: 2048m
listeners:
- name: plain
port: 9092
type: internal
tls: false
configuration:
useServiceDnsDomain: true
replicas: 2
storage:
class: local
size: 2Gi
type: persistent-claim
version: 2.6.0
zookeeper:
config:
ssl.hostnameVerification: false
ssl.quorum.hostnameVerification: false
jvmOptions:
-Xms: 1024G
-Xmx: 1024G
replicas: 3
storage:
class: local
size: 1Gi
type: persistent-claim
maintenanceTimeWindows:
- "* * 0-1 ? * SUN,MON,TUE,WED,THU *" // This is a desperate try to switch off Rolling upgrade, did not work.
Expected behavior Kafka operator should not Rolling upgrade Kafka brokers.
Environment (please complete the following information):
- Strimzi version: 0.20.0
- Installation method: HelmRelease (Helm)
- Kubernetes cluster: Kubernetes 1.17.4
- Infrastructure: Intel bare metal stack
YAML files and logs
Kafka broker logs look the same as:
2020-11-24 08:51:58,843 INFO [Controller id=0] Starting replica leader election (PREFERRED) for partitions triggered by ZkTriggered (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:51:58,851 INFO [Controller id=0] Starting the controller scheduler (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:51:58,864 INFO [RequestSendThread controllerId=0] Controller 0 connected to kafka-default-development-kafka-0.kafka-default-development-kafka-brokers.development.svc:9091 (id: 0 rack: null) for sending state change requests (kafka.controller.RequestSendThread) [Controller-0-to-broker-0-send-thread]
2020-11-24 08:51:58,871 TRACE [Controller id=0 epoch=31] Received response {error_code=0,_tagged_fields={}} for request UPDATE_METADATA with correlation id 0 sent to broker kafka-default-development-kafka-0.kafka-default-development-kafka-brokers.development.svc:9091 (id: 0 rack: null) (state.change.logger) [Controller-0-to-broker-0-send-thread]
2020-11-24 08:52:03,852 INFO [Controller id=0] Processing automatic preferred replica leader election (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:03,853 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:13,889 INFO [Controller id=0] Newly added brokers: 1, deleted brokers: , bounced brokers: , all live brokers: 0,1 (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:13,889 DEBUG [Channel manager on controller 0]: Controller 0 trying to connect to broker 1 (kafka.controller.ControllerChannelManager) [controller-event-thread]
2020-11-24 08:52:13,992 INFO [RequestSendThread controllerId=0] Starting (kafka.controller.RequestSendThread) [Controller-0-to-broker-1-send-thread]
2020-11-24 08:52:13,994 INFO [Controller id=0] New broker startup callback for 1 (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:13,995 INFO [Controller id=0 epoch=31] Sending UpdateMetadata request to brokers Set(0) for 0 partitions (state.change.logger) [controller-event-thread]
2020-11-24 08:52:13,995 INFO [Controller id=0 epoch=31] Sending UpdateMetadata request to brokers Set(1) for 0 partitions (state.change.logger) [controller-event-thread]
2020-11-24 08:52:13,998 TRACE [Controller id=0 epoch=31] Received response {error_code=0,_tagged_fields={}} for request UPDATE_METADATA with correlation id 1 sent to broker kafka-default-development-kafka-0.kafka-default-development-kafka-brokers.development.svc:9091 (id: 0 rack: null) (state.change.logger) [Controller-0-to-broker-0-send-thread]
2020-11-24 08:52:13,998 DEBUG [Controller id=0] Register BrokerModifications handler for Vector(1) (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:14,000 INFO [Controller id=0] Updated broker epochs cache: Map(1 -> 4294967939, 0 -> 4294967920) (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:14,236 INFO [RequestSendThread controllerId=0] Controller 0 connected to kafka-default-development-kafka-1.kafka-default-development-kafka-brokers.development.svc:9091 (id: 1 rack: null) for sending state change requests (kafka.controller.RequestSendThread) [Controller-0-to-broker-1-send-thread]
2020-11-24 08:52:14,301 TRACE [Controller id=0 epoch=31] Received response {error_code=0,_tagged_fields={}} for request UPDATE_METADATA with correlation id 0 sent to broker kafka-default-development-kafka-1.kafka-default-development-kafka-brokers.development.svc:9091 (id: 1 rack: null) (state.change.logger) [Controller-0-to-broker-1-send-thread]
After this last TRACE log, brokers receive a SIGTERM.
Zookeepers look good:
2020-11-24 09:35:57,073 INFO Processing ruok command from /127.0.0.1:37096 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-4-2]
2020-11-24 09:36:06,323 INFO Processing ruok command from /127.0.0.1:37706 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-4-1]
2020-11-24 09:36:07,066 INFO Processing ruok command from /127.0.0.1:37758 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-4-2]
2020-11-24 09:36:14,874 INFO Authenticated Id 'CN=cluster-operator,O=io.strimzi' for Scheme 'x509' (org.apache.zookeeper.server.auth.X509AuthenticationProvider) [nioEventLoopGroup-7-1]
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (8 by maintainers)
Top Results From Across the Web
Kafka rolling upgrade with changing inter.broker.listener.name
I have a kafka 2.4.1 cluster with 3 brokers. Listeners on these brokers were configured slightly not-bestpractice way.
Read more >A magic error on Kafka - Stuck in the loop
After upgrading the client version, we planned doing 3 sequential rolling updates on the broker side to avoid losing messages or any kind...
Read more >[#KAFKA-9531] java.net.UnknownHostException loop on VM ...
UnknownHostException loop on VM rolling update using CNAME. Status: Assignee: Priority: Resolution: Open. Unassigned. Major. Unresolved.
Read more >Using Strimzi (0.27.0)
Strimzi Drain Cleaner annotates pods being evicted with a rolling update ... For the Kafka Bridge loggers, you can set the log level...
Read more >Troubleshoot Confluent for Kubernetes
The ConfluentRolebindings custom resources (CRs) can be stuck in the DELETING state if associated Kafka cluster is removed. Solution: Manually remove the ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@zzvara I opened a separate issue #4031 for the Helm Chart since it would be easy to lost in this one. Do you think we can close this one now? Or do you have something else? Thanks
I think you are able to override some, but not set arbitrary. I’m not sure I will get to it and I do not know Helm well enough to allow setting any env vars. But contributions are always welcome 😉.