KafkaRoller may continue to try to roll a kafka's pods long after the kafka is deleted
See original GitHub issueDescribe the bug
Strimzi (latest)
If a kafka is deleted (kafka CR removed), it is possible for KafkaRoller to continue to try act on the kafka for many minutes, needlessly, continually failing at each retry. This is wasteful of resources.
In our use-case case, Strimzi will be managing a large number of kafkas with the set of kafka mutating relatively quickly. There is the real possibility that useful work is delayed.
In the example I highlight below, Strimzi is still needlessly processing the CR (Reconciliation no. 20), 17 minutes after it was deleted.
2021-04-30 11:04:19 INFO OperatorWatcher:40 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Kafka foo-fzz4yx57aem1j0b in namespace foo-fzz4yx57aem1j0b was ADDED
..
2021-04-30 11:06:40 INFO OperatorWatcher:40 - Reconciliation #114(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Kafka foo-fzz4yx57aem1j0b in namespace foo-fzz4yx57aem1j0b was DELETED
2021-04-30 11:07:48 INFO KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 1 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
...
2021-04-30 11:07:48 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 2 after delay of 0 MILLISECONDS
2021-04-30 11:08:18 INFO KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 2 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-04-30 11:08:18 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 0 after delay of 250 MILLISECONDS
2021-04-30 11:08:48 INFO KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-04-30 11:08:48 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 1 after delay of 250 MILLISECONDS
2021-04-30 11:09:18 INFO KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 1 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
....
2021-04-30 11:23:53 INFO KafkaRoller:289 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 2, giving up after 10 attempts. Total delay between attempts 127750ms
To Reproduce https://github.com/k-wall/strzimi-del-problem/blob/main/create_kafkas.sh
Steps to reproduce the behavior:
- Install Strimzi using quick start, configure for
STRIMZI_NAMESPACE
*
following docs create_kafkas.sh 50
to create 50 kafka- wait until approximately 50% have become ready.
oc delete k -l kafka=true --all-namespaces
- Watch logs
Expected behavior Efficient handling of the kafka delete case, short circuiting long running expensive tasks.
Environment (please complete the following information):
- Strimzi version: 0.22.1
- Installation method: Yaml
- Kubernetes cluster: 4.7.2
- Infrastructure: AWS multi region.
YAML files and logs
Attached
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:7 (4 by maintainers)
Top GitHub Comments
I have the same problem. If there’s a mistake in the configuration of the kafka cluster, KafkaRoller enters a loop and it does not recover; keeps trying to roll the brokers and it does not stop, not even when the kafka cluster is deleted. It actually takes a long time to stop trying to reconcile the cluster.
Thanks @scholzj for replying I created an issue for it https://github.com/strimzi/strimzi-kafka-operator/issues/7484