question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to upgrade zookeeper

See original GitHub issue

Please use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions, our #strimzi Slack channel or out user mailing list.

Describe the bug We are trying to upgrade strimzi from 0.24.0 to 0.26.0, But it does not get past de upgrade of zookeeper. Zookeeper enters a crash loop with a message that is unable to connected with other zookeeper instances. When I increase the failure timeout on the livenessProbe (so zookeeper will stay online a bit longer I get Refusing session request for client \/192.168.192.2:45912 as it has seen zxid 0x900000355 our last zxid is 0x500000002 client must try another server

To Reproduce Steps to reproduce the behavior:

  1. Install strimzi 0.24 with Version 2.8.0
  2. Upgrade the operator
  3. Change Version to 3.0.0
  4. See error

Expected behavior I expect strimzi to update zookeeper in a way that does not brake it.

Environment (please complete the following information):

  • Strimzi version: [e.g. main, 0.26.0]
  • Installation method: [e.g. YAML files]
  • Kubernetes cluster: [e.g. Kubernetes 1.22.3, virtual machines on azure cloud]
  • Infrastructure: Azure VirtualMachines

YAML files and logs zokeeper-healthy-probe-10.log zookeeper-healthy-probe-3.txt

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: salvador
  namespace: default
spec:
  kafka:
    version: 3.0.0
    replicas: 3
    listeners:
      - name: tls
        port: 9094
        type: internal
        tls: true
        configuration:
          brokerCertChainAndKey:
            secretName: k8s-cert
            certificate: tls.crt
            key: tls.key
        authentication:
          type: scram-sha-512
    authorization:
      type: simple
      superUsers:
        - salvador-clusteradmin
    config:
      log.message.format.version: "2.8"
      inter.broker.protocol.version: "2.8"
      offsets.topic.replication.factor: 1
      transaction.state.log.replication.factor: 1
      transaction.state.log.min.isr: 1
      group.max.session.timeout.ms: 1800000
      group.min.session.timeout.ms: 6000
    storage:
      deleteClaim: false
      size: 256Gi
      type: persistent-claim
      selector:
        usage: kafka-storage
    template:
      pod:
        metadata:
          annotations:
            prometheus.io/port: 9404
            prometheus.io/scrape: true
          labels:
            elasticsearch_index: kafka_applicationlogs
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
    logging:
      type: inline
      loggers:
        kafka.root.logger.level: INFO
        log4j.appender.CONSOLE: org.apache.log4j.ConsoleAppender
        log4j.appender.CONSOLE.layout: net.logstash.log4j.JSONEventLayoutV1
  zookeeper:
    replicas: 3
    config:
      autopurge.snapRetainCount: 3
      # Interval in hours
      autopurge.purgeInterval: 1
    storage:
      deleteClaim: false
      size: 10Gi
      type: persistent-claim
      selector:
        usage: zookeeper-storage
    template:
      pod:
        metadata:
          annotations:
            prometheus.io/port: 9404
            prometheus.io/scrape: true
          labels:
            elasticsearch_index: zookeeper_applicationlogs
    logging:
      type: inline
      loggers:
        zookeeper.root.logger.level: INFO
        log4j.appender.CONSOLE: org.apache.log4j.ConsoleAppender
        log4j.appender.CONSOLE.layout: net.logstash.log4j.JSONEventLayoutV1
    resources:
      requests:
        memory: "250Mi"
        cpu: "100m"
      limits:
        memory: "500Mi"
        cpu: "300m"
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: zookeeper-metrics-config.yml
  kafkaExporter:
    template:
      pod:
        metadata:
          annotations:
            prometheus.io/port: 9404
            prometheus.io/scrape: true
  entityOperator:
    topicOperator: {}
    userOperator: {}

I have tried to reset zookeeper volumes, restart them all 1 once and tried to deploy the new version on a fresh cluster. Without the upgrade. But I keep running into the same issue where Zookeper does not start. Using the image: kafka:0.26.0-kafka-3.0.0. Do i need to wait for a new strimzi version or should I be able to upgrade strimzi operator to 0.26?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
CodeGlitchercommented, Nov 26, 2021

I saw some cpu throttling and increased the cpu limit from 300mi to 400mi and now all zookeeper instances are runnen. Bit strange that a bit of trottling crashed the pod but i can continue my upgrade now. So for me my issue is resolved

0reactions
scholzjcommented, Nov 26, 2021

That is weird => especially the zxid error seems weird in combination with CPU. But glad you managed to solve it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Zookeeper upgrade fails due to missing snapshots - Apache
Zookeeper upgrade fails due to missing snapshots. Status: Assignee: Priority: Resolution: Resolved. Unassigned. Major. Duplicate.
Read more >
Upgrade ZooKeeper - Hortonworks Data Platform
If the upgrade is unsuccessful or validations fail, follow the ZooKeeper downgrade steps in Downgrading the Cluster.
Read more >
Tableau server upgrade failed because Zookeeper is not ...
We tried to upgrade from version 2022-1-1 to 2022-1-2. But, it failed because Zookeeper is not getting started.
Read more >
How to deal with missing snapshot after ZooKeeper upgrade
The solution is to alter the configuration and append snapshot.trust.empty=true option to skip this check. $ sudo -u kafka cat /opt/kafka/kafka/ ...
Read more >
Unable to Start Zookeeper | Apigee Edge
Typically, ZooKeeper election failure is caused by a misconfigured myid. Use the resolution in Misconfigured ZooKeeper myid to address the election failure. If ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found