question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Takes too long for KafkaUser to be ready

See original GitHub issue

Describe the bug When creating Kafka users say 1-5 users it works just fine, KafkaUsers are in Ready status in no time. But when I try to create bulk KafkaUser - say 50 users at a time, some users can take almost 10 mins to be “Ready”, so this means client cannot publish/consume messages using these KafkaUsers - wait can be very long as the number of KafkaUsers on a Kafka cluster increase.

To Reproduce Steps to reproduce the behavior:

  1. To repro - you need 1. Kafka cluster, 1. topic already created.

  2. Create 50 KafkaUsers on Strimzi Kafka cluster at the same time - via a script (I create it via K8 api)

  3. Right away with in a minute trying to use all these users to publish/read from using these users and you will get Topic Authorization Failed, Group Authorization Failed errors, because users are not yet “Ready”

  4. Create Custom Resource ‘Kafka User’ using the yaml below:

kind: KafkaUser
metadata:
  name: cog-reader
  labels:
    strimzi.io/cluster: pod-kafka-cluster
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Read
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Describe
      - resource:
          type: group
          name: cog-group
          patternType: literal
        operation: Read
  1. Authorization simple, authentication SASL_SSL/SCRAM-SHA-512
  2. Apply 50 such users in parallel to Strimzi cluster.
  3. Check the logs of the user-operator and it is cluttered with TimeoutExceptions
2021-10-09 06:55:40 DEBUG KafkaAdminClient:815 - [AdminClient clientId=adminclient-1] Call(callName=describeAcls, deadlineMs=1633762539991, tries=1, nextAllowedTryMs=1633762540099) timed out at 1633762539999 after 1 attempt(s)
java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: describeAcls
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:912) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutCallsToSend(KafkaAdminClient.java:993) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1301) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
2021-10-09 06:55:40 DEBUG KafkaAdminClient:815 - [AdminClient clientId=adminclient-1] Call(callName=alterUserScramCredentials, deadlineMs=1633762539992, tries=1, nextAllowedTryMs=1633762540099) timed out at 1633762539999 after 1 attempt(s)
java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: alterUserScramCredentials
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:912) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutCallsToSend(KafkaAdminClient.java:993) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1301) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
rg.apache.kafka.common.errors.TimeoutException: Call(callName=describeClientQuotas, deadlineMs=1633758992264, tries=1, nextAllowedTryMs=1633758992365) timed out at 1633758992265 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas
2021-10-09 05:56:32 DEBUG StatusDiff:46 - Status differs: {"op":"add","path":"/conditions/0/reason","value":"TimeoutException"}
2021-10-09 05:56:32 DEBUG StatusDiff:48 - Desired Status path /conditions/0/reason has value "TimeoutException"
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeClientQuotas, deadlineMs=1633758992264, tries=1, nextAllowedTryMs=1633758992365) timed out at 1633758992265 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas
java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas
java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas

java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas
        at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:912) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutCallsToSend(KafkaAdminClient.java:993) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1301) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

Expected behavior Creating Kafka user with ACL’s should be a straightforward call and it should have deterministic output when 5 users are created or 50 users are created

Environment (please complete the following information):

  • Strimzi version: [e.g. main, 0.25.0]

  • Installation method:

    • kubectl get ns kafka || kubectl create ns kafka
    • helm repo add strimzi https://strimzi.io/charts/
    • helm upgrade --install strimzi-kafka strimzi/strimzi-kafka-operator --version=“0.25.0” -n kafka
  • Kubernetes cluster: EKS 1.18

  • Infrastructure: Amazon EKS

YAML files and logs Kafka -cluster yaml

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: pod-kafka-cluster
  namespace: kafka
spec:
  kafka:
    version: 2.8.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: external
        authentication:
          type: scram-sha-512
        port: 9094
        tls: true
        type: loadbalancer
    authorization:
      type: simple
    config:
      offsets.topic.replication.factor: 1
      transaction.state.log.replication.factor: 1
      transaction.state.log.min.isr: 1
      log.message.format.version: "2.8"
    storage:
      type: persistent-claim
      size: 50Gi
      deleteClaim: true
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 50Gi
      deleteClaim: true
  entityOperator:
    userOperator: 
      reconciliationIntervalSeconds: 157680000
      logging:
        type: inline
        loggers:
          rootLogger.level: DEBUG      
    topicOperator: 
      reconciliationIntervalSeconds: 157680000
      logging:
        type: inline
        loggers:
          rootLogger.level: DEBUG      

Topic yaml

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: pod-notifications
  labels:
    strimzi.io/cluster: pod-kafka-cluster
spec:
  partitions: 1
  replicas: 3

Publisher user (create multiple - this is just an example)

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: pod-agent-writer
  labels:
    strimzi.io/cluster: pod-kafka-cluster
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Write
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Describe

Consumer - create multiple this is jus an example

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: cog-reader
  labels:
    strimzi.io/cluster: pod-kafka-cluster
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Read
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Describe
      - resource:
          type: group
          name: cog-group
          patternType: literal
        operation: Read

Additional context Use case: I have need to create Kafka. users/topics on demand, so I have a micro service that invokes K8’s api that in-turn invoke Strimzi API’s to create Kafka resources. topic operator works fine and topics are in ready status. KafkaUsers can take too long to be ready - and this causes some components that rely on this micro service to fail.

Earlier periodic reconciliation was causing too much issues, so I turned periodic reconciliation off. Now as the number of users have increased all those timeout errors are back. I don’t know. if Kafka. Admin client is already busy and ignoring all the calls?

Any help here is appreciated. Any pointers,

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:41 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
scholzjcommented, Feb 19, 2022

I don’t think we have any timeline. But it is on our TODO list.

1reaction
sknot-rhcommented, Oct 12, 2021

@GarimaBathla I did further testing with 500 KafkaUsers. I did find out that each KafkaUser does require 5 AdminClient calls each reconciliation. That means high number of KUs is basically DDoSing the kafka brokers in the matter of AC requests. I did try to increase number of kafka brokers (3 -> 7) which improved the situation a bit, but there are still KUs which are not ready after long time. I think we should try to batch the requests to save some broker’s load (@tombentley 's idea). What is your case for creating such a high number of KUs?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Configuring Strimzi (In Development)
Configure a Kafka deployment using the Kafka resource. A Kafka cluster is deployed with a ZooKeeper cluster, so configuration options are also available...
Read more >
Kafka on Kubernetes, the Strimzi way! (Part 3) - ITNEXT
All you do is create instances of KafkaUser CRDs and Strimzi takes care of the Kafka specific user management parts.
Read more >
Embedded Kafka Spring test executes before ... - Stack Overflow
the event is received, latch is counted down, but still the test (sending) is executed too fast, (total 6 test, first test fails)...
Read more >
How to set up Kafka on Kubernetes with Strimzi in 5 minutes
Today, in the second part of our Event-driven systems series, we take a look at ... the topic operator, offers the option of...
Read more >
Using AMQ Streams on OpenShift - Red Hat Customer Portal
Container images for AMQ Streams are available in the Red Hat Ecosystem Catalog. ... Very large reassignments should be broken down into a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found