Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to upgrade zookeeper

See original GitHub issue

Please use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions, our #strimzi Slack channel or out user mailing list.

Describe the bug We are trying to upgrade strimzi from 0.24.0 to 0.26.0, But it does not get past de upgrade of zookeeper. Zookeeper enters a crash loop with a message that is unable to connected with other zookeeper instances. When I increase the failure timeout on the livenessProbe (so zookeeper will stay online a bit longer I get Refusing session request for client \/192.168.192.2:45912 as it has seen zxid 0x900000355 our last zxid is 0x500000002 client must try another server

To Reproduce Steps to reproduce the behavior:

Install strimzi 0.24 with Version 2.8.0
Upgrade the operator
Change Version to 3.0.0
See error

Expected behavior I expect strimzi to update zookeeper in a way that does not brake it.

Environment (please complete the following information):

Strimzi version: [e.g. main, 0.26.0]
Installation method: [e.g. YAML files]
Kubernetes cluster: [e.g. Kubernetes 1.22.3, virtual machines on azure cloud]
Infrastructure: Azure VirtualMachines

YAML files and logs zokeeper-healthy-probe-10.log zookeeper-healthy-probe-3.txt

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: salvador
  namespace: default
spec:
  kafka:
    version: 3.0.0
    replicas: 3
    listeners:
      - name: tls
        port: 9094
        type: internal
        tls: true
        configuration:
          brokerCertChainAndKey:
            secretName: k8s-cert
            certificate: tls.crt
            key: tls.key
        authentication:
          type: scram-sha-512
    authorization:
      type: simple
      superUsers:
        - salvador-clusteradmin
    config:
      log.message.format.version: "2.8"
      inter.broker.protocol.version: "2.8"
      offsets.topic.replication.factor: 1
      transaction.state.log.replication.factor: 1
      transaction.state.log.min.isr: 1
      group.max.session.timeout.ms: 1800000
      group.min.session.timeout.ms: 6000
    storage:
      deleteClaim: false
      size: 256Gi
      type: persistent-claim
      selector:
        usage: kafka-storage
    template:
      pod:
        metadata:
          annotations:
            prometheus.io/port: 9404
            prometheus.io/scrape: true
          labels:
            elasticsearch_index: kafka_applicationlogs
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
    logging:
      type: inline
      loggers:
        kafka.root.logger.level: INFO
        log4j.appender.CONSOLE: org.apache.log4j.ConsoleAppender
        log4j.appender.CONSOLE.layout: net.logstash.log4j.JSONEventLayoutV1
  zookeeper:
    replicas: 3
    config:
      autopurge.snapRetainCount: 3
      # Interval in hours
      autopurge.purgeInterval: 1
    storage:
      deleteClaim: false
      size: 10Gi
      type: persistent-claim
      selector:
        usage: zookeeper-storage
    template:
      pod:
        metadata:
          annotations:
            prometheus.io/port: 9404
            prometheus.io/scrape: true
          labels:
            elasticsearch_index: zookeeper_applicationlogs
    logging:
      type: inline
      loggers:
        zookeeper.root.logger.level: INFO
        log4j.appender.CONSOLE: org.apache.log4j.ConsoleAppender
        log4j.appender.CONSOLE.layout: net.logstash.log4j.JSONEventLayoutV1
    resources:
      requests:
        memory: "250Mi"
        cpu: "100m"
      limits:
        memory: "500Mi"
        cpu: "300m"
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: zookeeper-metrics-config.yml
  kafkaExporter:
    template:
      pod:
        metadata:
          annotations:
            prometheus.io/port: 9404
            prometheus.io/scrape: true
  entityOperator:
    topicOperator: {}
    userOperator: {}

I have tried to reset zookeeper volumes, restart them all 1 once and tried to deploy the new version on a fresh cluster. Without the upgrade. But I keep running into the same issue where Zookeper does not start. Using the image: kafka:0.26.0-kafka-3.0.0. Do i need to wait for a new strimzi version or should I be able to upgrade strimzi operator to 0.26?

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

CodeGlitchercommented, Nov 26, 2021

I saw some cpu throttling and increased the cpu limit from 300mi to 400mi and now all zookeeper instances are runnen. Bit strange that a bit of trottling crashed the pod but i can continue my upgrade now. So for me my issue is resolved

0reactions

scholzjcommented, Nov 26, 2021

That is weird => especially the zxid error seems weird in combination with CPU. But glad you managed to solve it.

Top Results From Across the Web

Zookeeper upgrade fails due to missing snapshots - Apache

Zookeeper upgrade fails due to missing snapshots. Status: Assignee: Priority: Resolution: Resolved. Unassigned. Major. Duplicate.

Upgrade ZooKeeper - Hortonworks Data Platform

If the upgrade is unsuccessful or validations fail, follow the ZooKeeper downgrade steps in Downgrading the Cluster.

Tableau server upgrade failed because Zookeeper is not ...

We tried to upgrade from version 2022-1-1 to 2022-1-2. But, it failed because Zookeeper is not getting started.

How to deal with missing snapshot after ZooKeeper upgrade

The solution is to alter the configuration and append snapshot.trust.empty=true option to skip this check. $ sudo -u kafka cat /opt/kafka/kafka/ ...

Unable to Start Zookeeper | Apigee Edge

Typically, ZooKeeper election failure is caused by a misconfigured myid. Use the resolution in Misconfigured ZooKeeper myid to address the election failure. If ......