question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Producer.Flush hangs indefinitely when partition has no leader

See original GitHub issue

Description

Calling Flush() on a producer hangs indefinitely when the target partition has no leader, regardless of message.timeout.ms. Delivery handlers are not invoked during this time to indicate local message timeouts. Delivery handlers are not invoked until the partition obtains a leader.

How to reproduce

  1. Create a 3-broker Kafka cluster following the instructions in the Kafka Quickstart.
  2. Create a topic with a replication factor of 2 named my-replicated-topic
    $ bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 1 --topic
    
  3. Determine which brokers have a partition replica. In the example below, brokers 0 and 1 contain partition replicas. Broker 2 does not have a replica of this partition.
    $ bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-replicated-topic
    Topic:my-replicated-topic	PartitionCount:1	ReplicationFactor:2	Configs:segment.bytes=1073741824
        Topic: my-replicated-topic	Partition: 0	Leader: 1	Replicas: 1,0	Isr: 1,0
    
  4. Everything should work at this point. You should be able to run the dotnet application and get output like the following.
    $ cd my-producer
    $ dotnet run
    Hello World!
    Flushing... (shouldn't take more than 5000ms)
    [✓] Hello world 0
    [✓] Hello world 1
    [✓] Hello world 2
    [✓] Hello world 3
    [✓] Hello world 4
    Flush completed in less than 15000ms.
    
  5. Now, stop the two Kafka brokers that have replicas of your topic partition. In my example, I need to stop brokers 0 and 1. I will leave broker 2 running. The goal is to ensure that the partition does not have a leader. In the command below, you can see this via Leader: none. You may need to change the bootstrap-sever depending on which broker is still available.
    $ bin/kafka-topics.sh --bootstrap-server localhost:9094 --describe --topic my-replicated-topic
    Topic:my-replicated-topic	PartitionCount:1	ReplicationFactor:2	Configs:segment.bytes=1073741824
        Topic: my-replicated-topic	Partition: 0	Leader: none	Replicas: 1,0	Isr:
    
  6. Run the dotnet application again. You will notice that we do not successfully deliver messages (this is expected). The problems are that 1) we don’t receive any failed delivery reports and 2) the call to Flush() does not respect the MessageTimeoutMs.
    $ dotnet run
    Hello World!
    [ProducerError] localhost:9093/bootstrap: Connect to ipv6#[::1]:9093 failed: Connection refused (after 1ms in state CONNECT)
    Flushing... (shouldn't take more than 5000ms)
    Flush DID NOT complete after 15000ms!
    

Checklist

Please provide the following information:

  • A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file. See this Gist
  • Confluent.Kafka nuget version. 1.1.0
  • Apache Kafka version. kafka_2.3.0
  • Client configuration.
            var producerConfig = new ProducerConfig
            {
                BootstrapServers = "localhost:9092,localhost:9093,localhost:9094",
                MessageTimeoutMs = 5000,
            };
    
  • Operating system. OSX & Windows
  • Provide logs (with “debug” : “…” as necessary in configuration).
  • Provide broker log excerpts.
  • Critical issue.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mhowlettcommented, Aug 13, 2019

closing this as there’s now a PR open with a fix - thanks for bringing this to our attention. it’ll be resolved in the next release.

0reactions
edenhillcommented, Aug 13, 2019

Thank you!

Reproduced and fixed in this PR: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1027

Read more comments on GitHub >

github_iconTop Results From Across the Web

KafkaProducer from kafka.clients hangs when some ...
when a partition is completely offline, e.g. a topic with replication factor = 1 and some broker is down, KafkaProducer seems to be...
Read more >
5 Common Pitfalls When Using Apache Kafka
1. Setting request.timeout.ms too low · 2. Misunderstanding producer retries and retriable exceptions. From the broker side: · 3. Leaving key ...
Read more >
Kafka There is no leader for this topic-partition as we are in ...
I stopped my primary broker (Kafka1) using docker stop kafka1, and i tried then to send a message to my cluster to see...
Read more >
[#KAFKA-4669] KafkaProducer.flush hangs when ...
KafkaProducer.flush hangs when NetworkClient.handleCompletedReceives throws exception. Status: Assignee: Priority: Resolution:.
Read more >
kafka-python Documentation
KafkaProducer is a high-level, asynchronous message producer. ... even if we haven't seen any partition leadership changes to.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found