Unable to upgrade zookeeper
See original GitHub issuePlease use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions, our #strimzi Slack channel or out user mailing list.
Describe the bug
We are trying to upgrade strimzi from 0.24.0 to 0.26.0,
But it does not get past de upgrade of zookeeper.
Zookeeper enters a crash loop with a message that is unable to connected with other zookeeper instances.
When I increase the failure timeout on the livenessProbe (so zookeeper will stay online a bit longer I get
Refusing session request for client \/192.168.192.2:45912 as it has seen zxid 0x900000355 our last zxid is 0x500000002 client must try another server
To Reproduce Steps to reproduce the behavior:
- Install strimzi 0.24 with Version 2.8.0
- Upgrade the operator
- Change Version to 3.0.0
- See error
Expected behavior I expect strimzi to update zookeeper in a way that does not brake it.
Environment (please complete the following information):
- Strimzi version: [e.g. main, 0.26.0]
- Installation method: [e.g. YAML files]
- Kubernetes cluster: [e.g. Kubernetes 1.22.3, virtual machines on azure cloud]
- Infrastructure: Azure VirtualMachines
YAML files and logs zokeeper-healthy-probe-10.log zookeeper-healthy-probe-3.txt
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: salvador
namespace: default
spec:
kafka:
version: 3.0.0
replicas: 3
listeners:
- name: tls
port: 9094
type: internal
tls: true
configuration:
brokerCertChainAndKey:
secretName: k8s-cert
certificate: tls.crt
key: tls.key
authentication:
type: scram-sha-512
authorization:
type: simple
superUsers:
- salvador-clusteradmin
config:
log.message.format.version: "2.8"
inter.broker.protocol.version: "2.8"
offsets.topic.replication.factor: 1
transaction.state.log.replication.factor: 1
transaction.state.log.min.isr: 1
group.max.session.timeout.ms: 1800000
group.min.session.timeout.ms: 6000
storage:
deleteClaim: false
size: 256Gi
type: persistent-claim
selector:
usage: kafka-storage
template:
pod:
metadata:
annotations:
prometheus.io/port: 9404
prometheus.io/scrape: true
labels:
elasticsearch_index: kafka_applicationlogs
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
logging:
type: inline
loggers:
kafka.root.logger.level: INFO
log4j.appender.CONSOLE: org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout: net.logstash.log4j.JSONEventLayoutV1
zookeeper:
replicas: 3
config:
autopurge.snapRetainCount: 3
# Interval in hours
autopurge.purgeInterval: 1
storage:
deleteClaim: false
size: 10Gi
type: persistent-claim
selector:
usage: zookeeper-storage
template:
pod:
metadata:
annotations:
prometheus.io/port: 9404
prometheus.io/scrape: true
labels:
elasticsearch_index: zookeeper_applicationlogs
logging:
type: inline
loggers:
zookeeper.root.logger.level: INFO
log4j.appender.CONSOLE: org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout: net.logstash.log4j.JSONEventLayoutV1
resources:
requests:
memory: "250Mi"
cpu: "100m"
limits:
memory: "500Mi"
cpu: "300m"
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: zookeeper-metrics-config.yml
kafkaExporter:
template:
pod:
metadata:
annotations:
prometheus.io/port: 9404
prometheus.io/scrape: true
entityOperator:
topicOperator: {}
userOperator: {}
I have tried to reset zookeeper volumes, restart them all 1 once and tried to deploy the new version on a fresh cluster. Without the upgrade. But I keep running into the same issue where Zookeper does not start. Using the image: kafka:0.26.0-kafka-3.0.0. Do i need to wait for a new strimzi version or should I be able to upgrade strimzi operator to 0.26?
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
I saw some cpu throttling and increased the cpu limit from 300mi to 400mi and now all zookeeper instances are runnen. Bit strange that a bit of trottling crashed the pod but i can continue my upgrade now. So for me my issue is resolved
That is weird => especially the zxid error seems weird in combination with CPU. But glad you managed to solve it.