Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Producer.Flush hangs indefinitely when partition has no leader

See original GitHub issue

Description

Calling Flush() on a producer hangs indefinitely when the target partition has no leader, regardless of message.timeout.ms. Delivery handlers are not invoked during this time to indicate local message timeouts. Delivery handlers are not invoked until the partition obtains a leader.

How to reproduce

Create a 3-broker Kafka cluster following the instructions in the Kafka Quickstart.

Create a topic with a replication factor of 2 named my-replicated-topic

$ bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 1 --topic

Determine which brokers have a partition replica. In the example below, brokers 0 and 1 contain partition replicas. Broker 2 does not have a replica of this partition.

$ bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-replicated-topic
Topic:my-replicated-topic	PartitionCount:1	ReplicationFactor:2	Configs:segment.bytes=1073741824
    Topic: my-replicated-topic	Partition: 0	Leader: 1	Replicas: 1,0	Isr: 1,0

Everything should work at this point. You should be able to run the dotnet application and get output like the following.

$ cd my-producer
$ dotnet run
Hello World!
Flushing... (shouldn't take more than 5000ms)
[✓] Hello world 0
[✓] Hello world 1
[✓] Hello world 2
[✓] Hello world 3
[✓] Hello world 4
Flush completed in less than 15000ms.

Now, stop the two Kafka brokers that have replicas of your topic partition. In my example, I need to stop brokers 0 and 1. I will leave broker 2 running. The goal is to ensure that the partition does not have a leader. In the command below, you can see this via Leader: none. You may need to change the bootstrap-sever depending on which broker is still available.
```
$ bin/kafka-topics.sh --bootstrap-server localhost:9094 --describe --topic my-replicated-topic
Topic:my-replicated-topic	PartitionCount:1	ReplicationFactor:2	Configs:segment.bytes=1073741824
    Topic: my-replicated-topic	Partition: 0	Leader: none	Replicas: 1,0	Isr:
```
Run the dotnet application again. You will notice that we do not successfully deliver messages (this is expected). The problems are that 1) we don’t receive any failed delivery reports and 2) the call to Flush() does not respect the MessageTimeoutMs.
```
$ dotnet run
Hello World!
[ProducerError] localhost:9093/bootstrap: Connect to ipv6#[::1]:9093 failed: Connection refused (after 1ms in state CONNECT)
Flushing... (shouldn't take more than 5000ms)
Flush DID NOT complete after 15000ms!
```

Checklist

Please provide the following information:

A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file. See this Gist
Confluent.Kafka nuget version. 1.1.0
Apache Kafka version. kafka_2.3.0

Client configuration.

        var producerConfig = new ProducerConfig
        {
            BootstrapServers = "localhost:9092,localhost:9093,localhost:9094",
            MessageTimeoutMs = 5000,
        };