Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Losing/dropping Messages in High Availability Producer

See original GitHub issue

Description

I have a topic that needs to

guarantee “at least once” sending, and
have the highest possible availability.

I have set up an asynchronous producer in a loop that sends 2 million random message at a topic “testpoc”. The topic “testpoc” is on 3 partitions in my 3 broker cluster and is set to a replication factor of 3. This all works perfectly under normal circumstances however, I need to test the contigency that a server may go down during a high load of incoming messages. To create this scenario I begin the producer which will take a couple minutes to send its 2 million messages and during that time I will stop whichever broker the quorum has elected as controller. When I do I would expect Kafka’s at least once guarantee to deliver all 2 million messages even if the controller goes down. However each time I do this I can see a loss of a hundred to several thousand messages. I have tried setting the producer to be synchronous like so: producer.ProduceAsync("garytest011", key, msgToSend.Msgs.ToString()).Result; however this totally tanks my throughput just as the documentation says it will. What is the best way to guarantee “at least once” sending without killing my performance?

How to reproduce

I’m running a 3 broker cluster of ZooKeeper/Kafka on 3 separate vm’s All settings are set to default In a C# .NET CORE 2.0 console app I create a basic producer like so:

static void Main(string[] args)
        {
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();

            var kafkaConfig = new Dictionary<string, object>()
            {
                ["bootstrap.servers"] = "kafkabroker01:19092,kafkabroker02:19092,kafkabroker03:19092",
                ["retries"] = 5,
                ["retry.backoff.ms"] = 1000,
                ["client.id"] = "test-clientid",
                ["socket.nagle.disable"] = true,
                ["default.topic.config"] = new Dictionary<string, object>()
                {
                    ["acks"] = -1, // "all"
                }
            };

            var producer = new Producer<string, string>(kafkaConfig, new StringSerializer(Encoding.UTF8), new StringSerializer(Encoding.UTF8));

            for (int index = 0; index < 200000; index++)
            {
                var key = "key" + index;
                var msgToSend = new SimpleClass(key, index, 10);
                var deliveryReport = producer.ProduceAsync("testtopic", key, msgToSend.Msgs.ToString())
                    .ContinueWith(task =>
                {
                    if (task.Result.Error.HasError)
                    {
                        Console.WriteLine($"Failed message: key={task.Result.Key} offset={task.Result.Offset} ");
                    }
                });
            }

            Console.WriteLine("Sent 2 million messages to testtopic");
        }

        public class SimpleClass
        {
            public string Key { get; set; }
            public int Counter { get; set; }
            public Dictionary<string, string> Msgs { get; set; }
            public int[] IntArray { get; set; }

            public SimpleClass(string key, int counter, int numMsgs)
            {
                Key = key;
                Counter = counter;
                Msgs = new Dictionary<string, string>(10);
                for (int i = 0; i < numMsgs; i++)
                {
                    var textGUID = Guid.NewGuid().ToString();
                    Msgs.Add(textGUID, "GUID = " + textGUID);
                }

                Random rnd = new Random();
                var size = rnd.Next(1, 50);
                IntArray = new int[size];
                for (int j = 0; j < size; j++)
                {
                    IntArray[j] = rnd.Next();
                }
            }
        }

While the process is running I stop the kafka elected controller to simulate a server down. Because the loop is done 2 million times there should be 2 million (or more because of at least once processing) messages in the topic. I count the messages in the topic with the following command: docker run --net=host --rm confluentinc/cp-kafka:3.3.1 kafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:29092 --topic testtopic --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}' And I see anywhere from a hundred to several thousand less than 2 million.

Checklist

Please provide the following information:

confluent-kafka-python and librdkafka version (confluent_kafka.version(3.3.1) and confluent_kafka.libversion(11.2)):
Apache Kafka broker version: confluentinc/cp-kafka:3.3.1
Client configuration: {...}
Operating system: Kafka OS: confluentinc/cp-kafka:3.3.1 docker images on Ubuntu OS
Provide client logs (with 'debug': '..' as necessary)
Provide broker log excerpts
Critical issue