question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Kafka brokers stuck in a Rolling update loop

See original GitHub issue

Describe the bug

From the operator logs we see:

2020-11-24 09:24:16 INFO  PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-0
2020-11-24 09:24:41 INFO  PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-1
2020-11-24 09:25:11 INFO  AbstractOperator:455 - Reconciliation #54(timer) Kafka(development/kafka-default-development): reconciled
2020-11-24 09:26:13 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace experimental...
2020-11-24 09:26:13 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace production...
2020-11-24 09:26:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace kube-system...
2020-11-24 09:26:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace staging...
2020-11-24 09:26:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace development...
2020-11-24 09:26:14 INFO  AbstractOperator:217 - Reconciliation #55(timer) Kafka(development/kafka-default-development): Kafka kafka-default-development will be checked for creation or modification
2020-11-24 09:26:16 INFO  PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-0
2020-11-24 09:26:44 INFO  PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-1
2020-11-24 09:27:15 INFO  AbstractOperator:455 - Reconciliation #55(timer) Kafka(development/kafka-default-development): reconciled
2020-11-24 09:28:13 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace experimental...
2020-11-24 09:28:13 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace production...
2020-11-24 09:28:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace kube-system...
2020-11-24 09:28:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace staging...
2020-11-24 09:28:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace development...
2020-11-24 09:28:14 INFO  AbstractOperator:217 - Reconciliation #56(timer) Kafka(development/kafka-default-development): Kafka kafka-default-development will be checked for creation or modification
2020-11-24 09:28:16 INFO  PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-0
2020-11-24 09:28:46 INFO  PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-1
2020-11-24 09:29:18 INFO  AbstractOperator:455 - Reconciliation #56(timer) Kafka(development/kafka-default-development): reconciled
2020-11-24 09:30:13 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace experimental...
2020-11-24 09:30:13 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace production...
2020-11-24 09:30:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace kube-system...
2020-11-24 09:30:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace staging...
2020-11-24 09:30:14 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace development...
2020-11-24 09:30:14 INFO  AbstractOperator:217 - Reconciliation #57(timer) Kafka(development/kafka-default-development): Kafka kafka-default-development will be checked for creation or modification
2020-11-24 09:30:16 INFO  PodOperator:65 - Rolling update of development/kafka-default-development-kafka: Rolling pod kafka-default-development-kafka-0

Kafka brokers get restarted every ~2 minutes.

To Reproduce Steps to reproduce the behavior:

  1. Install Operator as:
---
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: strimzi
  namespace: kube-system
spec:
  chart:
    repository: https://strimzi.io/charts/
    name: strimzi-kafka-operator
    version: 0.20.0
  releaseName: strimzi
  forceUpgrade: true
  values:
    watchNamespaces:
      - development
      - production
      - staging
      - experimental
  1. Add Kafka cluster as:
apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: kafka-default-development
  namespace: development
spec:
  kafka:
    config:
      auto.create.topics.enable: "true"
      default.replication.factor: 2
      log.retention.check.interval.ms: 300000
      log.retention.hours: 2
      num.partitions: 20
      offsets.topic.replication.factor: 2
      transaction.state.log.replication.factor: 2
    jvmOptions:
      -Xms: 2048m
      -Xmx: 2048m
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
        configuration:
          useServiceDnsDomain: true
    replicas: 2
    storage:
      class: local
      size: 2Gi
      type: persistent-claim
    version: 2.6.0
  zookeeper:
    config:
      ssl.hostnameVerification: false
      ssl.quorum.hostnameVerification: false
    jvmOptions:
      -Xms: 1024G
      -Xmx: 1024G
    replicas: 3
    storage:
      class: local
      size: 1Gi
      type: persistent-claim
  maintenanceTimeWindows:
    - "* * 0-1 ? * SUN,MON,TUE,WED,THU *" // This is a desperate try to switch off Rolling upgrade, did not work.

Expected behavior Kafka operator should not Rolling upgrade Kafka brokers.

Environment (please complete the following information):

  • Strimzi version: 0.20.0
  • Installation method: HelmRelease (Helm)
  • Kubernetes cluster: Kubernetes 1.17.4
  • Infrastructure: Intel bare metal stack

YAML files and logs

Kafka broker logs look the same as:

2020-11-24 08:51:58,843 INFO [Controller id=0] Starting replica leader election (PREFERRED) for partitions  triggered by ZkTriggered (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:51:58,851 INFO [Controller id=0] Starting the controller scheduler (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:51:58,864 INFO [RequestSendThread controllerId=0] Controller 0 connected to kafka-default-development-kafka-0.kafka-default-development-kafka-brokers.development.svc:9091 (id: 0 rack: null) for sending state change requests (kafka.controller.RequestSendThread) [Controller-0-to-broker-0-send-thread]
2020-11-24 08:51:58,871 TRACE [Controller id=0 epoch=31] Received response {error_code=0,_tagged_fields={}} for request UPDATE_METADATA with correlation id 0 sent to broker kafka-default-development-kafka-0.kafka-default-development-kafka-brokers.development.svc:9091 (id: 0 rack: null) (state.change.logger) [Controller-0-to-broker-0-send-thread]
2020-11-24 08:52:03,852 INFO [Controller id=0] Processing automatic preferred replica leader election (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:03,853 TRACE [Controller id=0] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:13,889 INFO [Controller id=0] Newly added brokers: 1, deleted brokers: , bounced brokers: , all live brokers: 0,1 (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:13,889 DEBUG [Channel manager on controller 0]: Controller 0 trying to connect to broker 1 (kafka.controller.ControllerChannelManager) [controller-event-thread]
2020-11-24 08:52:13,992 INFO [RequestSendThread controllerId=0] Starting (kafka.controller.RequestSendThread) [Controller-0-to-broker-1-send-thread]
2020-11-24 08:52:13,994 INFO [Controller id=0] New broker startup callback for 1 (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:13,995 INFO [Controller id=0 epoch=31] Sending UpdateMetadata request to brokers Set(0) for 0 partitions (state.change.logger) [controller-event-thread]
2020-11-24 08:52:13,995 INFO [Controller id=0 epoch=31] Sending UpdateMetadata request to brokers Set(1) for 0 partitions (state.change.logger) [controller-event-thread]
2020-11-24 08:52:13,998 TRACE [Controller id=0 epoch=31] Received response {error_code=0,_tagged_fields={}} for request UPDATE_METADATA with correlation id 1 sent to broker kafka-default-development-kafka-0.kafka-default-development-kafka-brokers.development.svc:9091 (id: 0 rack: null) (state.change.logger) [Controller-0-to-broker-0-send-thread]
2020-11-24 08:52:13,998 DEBUG [Controller id=0] Register BrokerModifications handler for Vector(1) (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:14,000 INFO [Controller id=0] Updated broker epochs cache: Map(1 -> 4294967939, 0 -> 4294967920) (kafka.controller.KafkaController) [controller-event-thread]
2020-11-24 08:52:14,236 INFO [RequestSendThread controllerId=0] Controller 0 connected to kafka-default-development-kafka-1.kafka-default-development-kafka-brokers.development.svc:9091 (id: 1 rack: null) for sending state change requests (kafka.controller.RequestSendThread) [Controller-0-to-broker-1-send-thread]
2020-11-24 08:52:14,301 TRACE [Controller id=0 epoch=31] Received response {error_code=0,_tagged_fields={}} for request UPDATE_METADATA with correlation id 0 sent to broker kafka-default-development-kafka-1.kafka-default-development-kafka-brokers.development.svc:9091 (id: 1 rack: null) (state.change.logger) [Controller-0-to-broker-1-send-thread]

After this last TRACE log, brokers receive a SIGTERM.

Zookeepers look good:

2020-11-24 09:35:57,073 INFO Processing ruok command from /127.0.0.1:37096 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-4-2]
2020-11-24 09:36:06,323 INFO Processing ruok command from /127.0.0.1:37706 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-4-1]
2020-11-24 09:36:07,066 INFO Processing ruok command from /127.0.0.1:37758 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-4-2]
2020-11-24 09:36:14,874 INFO Authenticated Id 'CN=cluster-operator,O=io.strimzi' for Scheme 'x509' (org.apache.zookeeper.server.auth.X509AuthenticationProvider) [nioEventLoopGroup-7-1]

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
scholzjcommented, Dec 1, 2020

@zzvara I opened a separate issue #4031 for the Helm Chart since it would be easy to lost in this one. Do you think we can close this one now? Or do you have something else? Thanks

0reactions
scholzjcommented, Nov 26, 2020

I would be nice to be able to overwrite ENV variables from the Helm chart values.

I think you are able to override some, but not set arbitrary. I’m not sure I will get to it and I do not know Helm well enough to allow setting any env vars. But contributions are always welcome 😉.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kafka rolling upgrade with changing inter.broker.listener.name
I have a kafka 2.4.1 cluster with 3 brokers. Listeners on these brokers were configured slightly not-bestpractice way.
Read more >
A magic error on Kafka - Stuck in the loop
After upgrading the client version, we planned doing 3 sequential rolling updates on the broker side to avoid losing messages or any kind...
Read more >
[#KAFKA-9531] java.net.UnknownHostException loop on VM ...
UnknownHostException loop on VM rolling update using CNAME. Status: Assignee: Priority: Resolution: Open. Unassigned. Major. Unresolved.
Read more >
Using Strimzi (0.27.0)
Strimzi Drain Cleaner annotates pods being evicted with a rolling update ... For the Kafka Bridge loggers, you can set the log level...
Read more >
Troubleshoot Confluent for Kubernetes
The ConfluentRolebindings custom resources (CRs) can be stuck in the DELETING state if associated Kafka cluster is removed. Solution: Manually remove the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found