question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Losing/dropping Messages in High Availability Producer

See original GitHub issue

Description

I have a topic that needs to

  • guarantee “at least once” sending, and

  • have the highest possible availability.

I have set up an asynchronous producer in a loop that sends 2 million random message at a topic “testpoc”. The topic “testpoc” is on 3 partitions in my 3 broker cluster and is set to a replication factor of 3. This all works perfectly under normal circumstances however, I need to test the contigency that a server may go down during a high load of incoming messages. To create this scenario I begin the producer which will take a couple minutes to send its 2 million messages and during that time I will stop whichever broker the quorum has elected as controller. When I do I would expect Kafka’s at least once guarantee to deliver all 2 million messages even if the controller goes down. However each time I do this I can see a loss of a hundred to several thousand messages. I have tried setting the producer to be synchronous like so: producer.ProduceAsync("garytest011", key, msgToSend.Msgs.ToString()).Result; however this totally tanks my throughput just as the documentation says it will. What is the best way to guarantee “at least once” sending without killing my performance?

How to reproduce

I’m running a 3 broker cluster of ZooKeeper/Kafka on 3 separate vm’s All settings are set to default In a C# .NET CORE 2.0 console app I create a basic producer like so:

static void Main(string[] args)
        {
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();

            var kafkaConfig = new Dictionary<string, object>()
            {
                ["bootstrap.servers"] = "kafkabroker01:19092,kafkabroker02:19092,kafkabroker03:19092",
                ["retries"] = 5,
                ["retry.backoff.ms"] = 1000,
                ["client.id"] = "test-clientid",
                ["socket.nagle.disable"] = true,
                ["default.topic.config"] = new Dictionary<string, object>()
                {
                    ["acks"] = -1, // "all"
                }
            };

            var producer = new Producer<string, string>(kafkaConfig, new StringSerializer(Encoding.UTF8), new StringSerializer(Encoding.UTF8));

            for (int index = 0; index < 200000; index++)
            {
                var key = "key" + index;
                var msgToSend = new SimpleClass(key, index, 10);
                var deliveryReport = producer.ProduceAsync("testtopic", key, msgToSend.Msgs.ToString())
                    .ContinueWith(task =>
                {
                    if (task.Result.Error.HasError)
                    {
                        Console.WriteLine($"Failed message: key={task.Result.Key} offset={task.Result.Offset} ");
                    }
                });
            }

            Console.WriteLine("Sent 2 million messages to testtopic");
        }

        public class SimpleClass
        {
            public string Key { get; set; }
            public int Counter { get; set; }
            public Dictionary<string, string> Msgs { get; set; }
            public int[] IntArray { get; set; }

            public SimpleClass(string key, int counter, int numMsgs)
            {
                Key = key;
                Counter = counter;
                Msgs = new Dictionary<string, string>(10);
                for (int i = 0; i < numMsgs; i++)
                {
                    var textGUID = Guid.NewGuid().ToString();
                    Msgs.Add(textGUID, "GUID = " + textGUID);
                }

                Random rnd = new Random();
                var size = rnd.Next(1, 50);
                IntArray = new int[size];
                for (int j = 0; j < size; j++)
                {
                    IntArray[j] = rnd.Next();
                }
            }
        }

While the process is running I stop the kafka elected controller to simulate a server down. Because the loop is done 2 million times there should be 2 million (or more because of at least once processing) messages in the topic. I count the messages in the topic with the following command: docker run --net=host --rm confluentinc/cp-kafka:3.3.1 kafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:29092 --topic testtopic --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}' And I see anywhere from a hundred to several thousand less than 2 million.

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version (confluent_kafka.version(3.3.1) and confluent_kafka.libversion(11.2)):
  • Apache Kafka broker version: confluentinc/cp-kafka:3.3.1
  • Client configuration: {...}
  • Operating system: Kafka OS: confluentinc/cp-kafka:3.3.1 docker images on Ubuntu OS
  • Provide client logs (with 'debug': '..' as necessary)
  • Provide broker log excerpts
  • Critical issue

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:42 (21 by maintainers)

github_iconTop GitHub Comments

3reactions
edenhillcommented, Dec 12, 2017

We’re working on formalising and fixing the retry behaviour in librdkafka, we’ll keep you posted.

1reaction
gazareidcommented, Dec 13, 2017

@edenhill @mhowlett You guys are awesome. Thank you very much for your stellar community involvement.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kafka dotnet Library Losing/dropping Messages in High ...
I have set up an asynchronous producer in a loop that sends 1 million random message at a topic "testtopic". The topic "testtopic"...
Read more >
How to Lose Messages on a Kafka Cluster - Part 1
We see that we lost 6764 messages that the producer had sent. This is due to a combination of a connection failure and...
Read more >
How to achieve high availability for Apache Kafka
Learn how to use Apache Kafka to deploy highly available applications across different types of network topology.
Read more >
How to Survive a Kafka Outage
High availability and disaster recovery are closely related—plan them ... Any messages that time out and are resent into the Kafka producer ......
Read more >
Diagnose and Debug Apache Kafka Issues
Unsurprisingly, high throughput starts with the producers. Prior to sending messages off to the brokers, individual records destined for the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found