question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KafkaRoller may continue to try to roll a kafka's pods long after the kafka is deleted

See original GitHub issue

Describe the bug

Strimzi (latest)

If a kafka is deleted (kafka CR removed), it is possible for KafkaRoller to continue to try act on the kafka for many minutes, needlessly, continually failing at each retry. This is wasteful of resources.

In our use-case case, Strimzi will be managing a large number of kafkas with the set of kafka mutating relatively quickly. There is the real possibility that useful work is delayed.

In the example I highlight below, Strimzi is still needlessly processing the CR (Reconciliation no. 20), 17 minutes after it was deleted.

2021-04-30 11:04:19 INFO  OperatorWatcher:40 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Kafka foo-fzz4yx57aem1j0b in namespace foo-fzz4yx57aem1j0b was ADDED
..
2021-04-30 11:06:40 INFO  OperatorWatcher:40 - Reconciliation #114(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Kafka foo-fzz4yx57aem1j0b in namespace foo-fzz4yx57aem1j0b was DELETED


2021-04-30 11:07:48 INFO  KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 1 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
...
2021-04-30 11:07:48 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 2 after delay of 0 MILLISECONDS
2021-04-30 11:08:18 INFO  KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 2 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-04-30 11:08:18 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 0 after delay of 250 MILLISECONDS
2021-04-30 11:08:48 INFO  KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 0 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-04-30 11:08:48 DEBUG KafkaRoller:272 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Considering restart of pod 1 after delay of 250 MILLISECONDS
2021-04-30 11:09:18 INFO  KafkaRoller:296 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 1 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
....
2021-04-30 11:23:53 INFO  KafkaRoller:289 - Reconciliation #20(watch) Kafka(foo-fzz4yx57aem1j0b/foo-fzz4yx57aem1j0b): Could not roll pod 2, giving up after 10 attempts. Total delay between attempts 127750ms

To Reproduce https://github.com/k-wall/strzimi-del-problem/blob/main/create_kafkas.sh

Steps to reproduce the behavior:

  1. Install Strimzi using quick start, configure for STRIMZI_NAMESPACE * following docs
  2. create_kafkas.sh 50 to create 50 kafka
  3. wait until approximately 50% have become ready.
  4. oc delete k -l kafka=true --all-namespaces
  5. Watch logs

Expected behavior Efficient handling of the kafka delete case, short circuiting long running expensive tasks.

Environment (please complete the following information):

  • Strimzi version: 0.22.1
  • Installation method: Yaml
  • Kubernetes cluster: 4.7.2
  • Infrastructure: AWS multi region.

YAML files and logs

Attached

issue_4869.log

Additional context Add any other context about the problem here.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
redskcommented, Feb 7, 2022

I have the same problem. If there’s a mistake in the configuration of the kafka cluster, KafkaRoller enters a loop and it does not recover; keeps trying to roll the brokers and it does not stop, not even when the kafka cluster is deleted. It actually takes a long time to stop trying to reconcile the cluster.

0reactions
eslam-gomaacommented, Oct 14, 2022

Thanks @scholzj for replying I created an issue for it https://github.com/strimzi/strimzi-kafka-operator/issues/7484

Read more comments on GitHub >

github_iconTop Results From Across the Web

When/how does a topic "marked for deletion" get finally ...
The point is to have delete.topic.enable=true in config/server.properties that you use to start a Kafka broker. ➜ kafka_2.11- ...
Read more >
Deploying and Upgrading Strimzi
Rolling pods using the Strimzi Drain Cleaner; 11.3.2. ... After which, you can deploy other Kafka components and set up monitoring of your ......
Read more >
Connecting to your Managed Kafka instance from the ...
OpenShift will do the rest: Import the image, start it in a container, create a Service, create a Route, and create a Deployment...
Read more >
Understanding Kafka's Internal Storage and Log Retention
Apache Kafka is a commit-log system. Kafka allows us to optimize the log-related configurations. These configurations can impact broker ...
Read more >
Apache Kafka Example: How Rollbar Removed Technical Debt
While the Prometheus JMX exporter can be enabled changing the command to run Kafka, Kafka exporter needs to be deployed into your infrastructure ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found