Producer.Flush hangs indefinitely when partition has no leader
See original GitHub issueDescription
Calling Flush()
on a producer hangs indefinitely when the target partition has no leader, regardless of message.timeout.ms
. Delivery handlers are not invoked during this time to indicate local message timeouts. Delivery handlers are not invoked until the partition obtains a leader.
How to reproduce
- Create a 3-broker Kafka cluster following the instructions in the Kafka Quickstart.
- Create a topic with a replication factor of 2 named
my-replicated-topic
$ bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 1 --topic
- Determine which brokers have a partition replica. In the example below, brokers
0
and1
contain partition replicas. Broker2
does not have a replica of this partition.$ bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-replicated-topic Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:2 Configs:segment.bytes=1073741824 Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
- Everything should work at this point. You should be able to run the dotnet application and get output like the following.
$ cd my-producer $ dotnet run Hello World! Flushing... (shouldn't take more than 5000ms) [✓] Hello world 0 [✓] Hello world 1 [✓] Hello world 2 [✓] Hello world 3 [✓] Hello world 4 Flush completed in less than 15000ms.
- Now, stop the two Kafka brokers that have replicas of your topic partition. In my example, I need to stop brokers
0
and1
. I will leave broker2
running. The goal is to ensure that the partition does not have a leader. In the command below, you can see this viaLeader: none
. You may need to change thebootstrap-sever
depending on which broker is still available.$ bin/kafka-topics.sh --bootstrap-server localhost:9094 --describe --topic my-replicated-topic Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:2 Configs:segment.bytes=1073741824 Topic: my-replicated-topic Partition: 0 Leader: none Replicas: 1,0 Isr:
- Run the dotnet application again. You will notice that we do not successfully deliver messages (this is expected). The problems are that 1) we don’t receive any failed delivery reports and 2) the call to
Flush()
does not respect theMessageTimeoutMs
.$ dotnet run Hello World! [ProducerError] localhost:9093/bootstrap: Connect to ipv6#[::1]:9093 failed: Connection refused (after 1ms in state CONNECT) Flushing... (shouldn't take more than 5000ms) Flush DID NOT complete after 15000ms!
Checklist
Please provide the following information:
- A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file. See this Gist
- Confluent.Kafka nuget version. 1.1.0
- Apache Kafka version. kafka_2.3.0
- Client configuration.
var producerConfig = new ProducerConfig { BootstrapServers = "localhost:9092,localhost:9093,localhost:9094", MessageTimeoutMs = 5000, };
- Operating system. OSX & Windows
- Provide logs (with “debug” : “…” as necessary in configuration).
- Provide broker log excerpts.
- Critical issue.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
KafkaProducer from kafka.clients hangs when some ...
when a partition is completely offline, e.g. a topic with replication factor = 1 and some broker is down, KafkaProducer seems to be...
Read more >5 Common Pitfalls When Using Apache Kafka
1. Setting request.timeout.ms too low · 2. Misunderstanding producer retries and retriable exceptions. From the broker side: · 3. Leaving key ...
Read more >Kafka There is no leader for this topic-partition as we are in ...
I stopped my primary broker (Kafka1) using docker stop kafka1, and i tried then to send a message to my cluster to see...
Read more >[#KAFKA-4669] KafkaProducer.flush hangs when ...
KafkaProducer.flush hangs when NetworkClient.handleCompletedReceives throws exception. Status: Assignee: Priority: Resolution:.
Read more >kafka-python Documentation
KafkaProducer is a high-level, asynchronous message producer. ... even if we haven't seen any partition leadership changes to.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
closing this as there’s now a PR open with a fix - thanks for bringing this to our attention. it’ll be resolved in the next release.
Thank you!
Reproduced and fixed in this PR: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1027