question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Broker rolling upgrade loop when default value is set

See original GitHub issue

Describe the bug

When initialDelaySeconds is set to 0 (instead of omitted), it would cause broker to stuck in rolling upgrade spin loop.

# Source: kafka/templates/kafkacluster.yml
apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: kafka-cluster
spec:
  kafka:
    replicas: 5
    readinessProbe: 
      initialDelaySeconds: 0

To Reproduce Steps to reproduce the behavior:

  1. Set broker livenessProbe.initialDelaySeconds to 0;
  2. broker will keep upgrading indefinitely, with STS generation keep being updated;

Operator logs for spin loop

2020-12-03 05:38:51 DEBUG KafkaAssemblyOperator:3264 - Reconciliation #1(watch) Kafka(kafka/kafka-cluster): Rolling pod kafka-cluster-kafka-0 due to [Pod has old generation]
2020-12-03 05:38:51 DEBUG KafkaRoller:687 - Reconciliation #1(watch) Kafka(kafka/kafka-cluster): Creating AdminClient for kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.kafka.svc.cluster.local:9091,kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.kafka.svc.cluster.local:9091,kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.kafka.svc.cluster.local:9091,kafka-cluster-kafka-3.kafka-cluster-kafka-brokers.kafka.svc.cluster.local:9091,kafka-cluster-kafka-4.kafka-cluster-kafka-brokers.kafka.svc.cluster.local:9091
2020-12-03 05:38:53 INFO  KafkaRoller:500 - Reconciliation #1(watch) Kafka(kafka/kafka-cluster): Pod 0 needs to be restarted. Reason: [Pod has old generation]

Expected behavior

  1. no spin loop

Environment (please complete the following information):

  • Strimzi version: 0.20.0
  • Installation method: helm charts
  • Kubernetes cluster: eks 1.16
  • Infrastructure: aws eks

It’s believed when a default value is set, k8s api will return null instead of actual default, where Strimzi detects it as diff and triggers rolling upgrade.

This might be able to generalized into when default value is set on any field where k8s return null or omit from the api response.

Operator Log

2020-12-03 05:38:51 DEBUG KafkaSetOperator:102 - StatefulSet kafka/kafka-cluster-kafka already exists, patching it
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/revisionHistoryLimit"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/metadata/annotations/strimzi.io~1generation"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:102 - StatefulSet kafka/kafka-cluster-kafka differs: {"op":"add","path":"/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":0}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:103 - Current StatefulSet path /spec/template/spec/containers/0/livenessProbe/initialDelaySeconds has value 
2020-12-03 05:38:51 DEBUG StatefulSetDiff:104 - Desired StatefulSet path /spec/template/spec/containers/0/livenessProbe/initialDelaySeconds has value 0
2020-12-03 05:38:51 DEBUG StatefulSetDiff:102 - StatefulSet kafka/kafka-cluster-kafka differs: {"op":"add","path":"/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds","value":0}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:103 - Current StatefulSet path /spec/template/spec/containers/0/readinessProbe/initialDelaySeconds has value 
2020-12-03 05:38:51 DEBUG StatefulSetDiff:104 - Desired StatefulSet path /spec/template/spec/containers/0/readinessProbe/initialDelaySeconds has value 0
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/containers/0/terminationMessagePath"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/containers/0/terminationMessagePolicy"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/dnsPolicy"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/initContainers/0/env/0/valueFrom/fieldRef/apiVersion"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/initContainers/0/terminationMessagePath"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/initContainers/0/terminationMessagePolicy"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/restartPolicy"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/serviceAccount"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/template/spec/volumes/4/configMap/defaultMode"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/volumeClaimTemplates/0/spec/volumeMode"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/spec/volumeClaimTemplates/0/status"}
2020-12-03 05:38:51 DEBUG StatefulSetDiff:86 - StatefulSet kafka/kafka-cluster-kafka ignoring diff {"op":"remove","path":"/status"}
2020-12-03 05:38:51 DEBUG KafkaSetOperator:54 - Changed template spec => needs rolling update
2020-12-03 05:38:51 DEBUG StatefulSetOperator:305 - Patching StatefulSet kafka/kafka-cluster-kafka
2020-12-03 05:38:51 DEBUG KafkaSetOperator:168 - StatefulSet kafka-cluster-kafka in namespace kafka has been patched
2020-12-03 05:38:51 DEBUG KafkaAssemblyOperator:927 - Kafka.spec.kafka.version unchanged
2020-12-03 05:38:51 DEBUG KafkaRoller:203 - Reconciliation #1(watch) Kafka(kafka/kafka-cluster): Initial order for rolling restart [0, 1, 2, 3, 4]

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
scholzjcommented, Dec 3, 2020

I opened a PR for this.

I feel like this bug could be more general than just initialDelaySeconds, where any field with default value could be affected. In terms of fixes, it would be great if it can be fixed programmatically, but even just by documenting it it would be a good start.

Kubernetes is not consistent in how it handles some of these situations. Different fields use different default values etc. So this is something what needs to be taken field by field really.

0reactions
PaulLiang1commented, Dec 3, 2020

Sorry for mixing up the default values. I feel like this bug could be more general than just initialDelaySeconds, where any field with default value could be affected. In terms of fixes, it would be great if it can be fixed programmatically, but even just by documenting it it would be a good start.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Streams application rolling upgrade - Google Groups
numStandbyReplicas is set to 1. Initial startup of an instance takes 15 seconds of getting some data from external sources. Then it calls...
Read more >
Upgrade | Confluent Platform 3.3.0
Follow the below steps for a rolling upgrade: Update server.properties on all Kafka brokers by modifying the properties inter.broker.protocol.
Read more >
Kafka 3.3 Documentation
For a rolling upgrade: Update server.properties on all brokers and add the following properties. CURRENT_KAFKA_VERSION refers to the version you are upgrading ...
Read more >
OpenShift Container Platform 4.10 release notes
New default component types for AWS installations ... provides links to debug terminals for each crash looping container within that pod.
Read more >
Oracle Database 12c Release 1 (12.1.0.1) New Features
Error handling and user-defined exception processing has been improved to allow ... Default values for columns can directly refer to Oracle sequences.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found