Losing/dropping Messages in High Availability Producer
See original GitHub issueDescription
I have a topic that needs to
-
guarantee “at least once” sending, and
-
have the highest possible availability.
I have set up an asynchronous producer in a loop that sends 2 million random message at a topic “testpoc”. The topic “testpoc” is on 3 partitions in my 3 broker cluster and is set to a replication factor of 3.
This all works perfectly under normal circumstances however, I need to test the contigency that a server may go down during a high load of incoming messages.
To create this scenario I begin the producer which will take a couple minutes to send its 2 million messages and during that time I will stop whichever broker the quorum has elected as controller.
When I do I would expect Kafka’s at least once guarantee to deliver all 2 million messages even if the controller goes down. However each time I do this I can see a loss of a hundred to several thousand messages.
I have tried setting the producer to be synchronous like so:
producer.ProduceAsync("garytest011", key, msgToSend.Msgs.ToString()).Result;
however this totally tanks my throughput just as the documentation says it will.
What is the best way to guarantee “at least once” sending without killing my performance?
How to reproduce
I’m running a 3 broker cluster of ZooKeeper/Kafka on 3 separate vm’s All settings are set to default In a C# .NET CORE 2.0 console app I create a basic producer like so:
static void Main(string[] args)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var kafkaConfig = new Dictionary<string, object>()
{
["bootstrap.servers"] = "kafkabroker01:19092,kafkabroker02:19092,kafkabroker03:19092",
["retries"] = 5,
["retry.backoff.ms"] = 1000,
["client.id"] = "test-clientid",
["socket.nagle.disable"] = true,
["default.topic.config"] = new Dictionary<string, object>()
{
["acks"] = -1, // "all"
}
};
var producer = new Producer<string, string>(kafkaConfig, new StringSerializer(Encoding.UTF8), new StringSerializer(Encoding.UTF8));
for (int index = 0; index < 200000; index++)
{
var key = "key" + index;
var msgToSend = new SimpleClass(key, index, 10);
var deliveryReport = producer.ProduceAsync("testtopic", key, msgToSend.Msgs.ToString())
.ContinueWith(task =>
{
if (task.Result.Error.HasError)
{
Console.WriteLine($"Failed message: key={task.Result.Key} offset={task.Result.Offset} ");
}
});
}
Console.WriteLine("Sent 2 million messages to testtopic");
}
public class SimpleClass
{
public string Key { get; set; }
public int Counter { get; set; }
public Dictionary<string, string> Msgs { get; set; }
public int[] IntArray { get; set; }
public SimpleClass(string key, int counter, int numMsgs)
{
Key = key;
Counter = counter;
Msgs = new Dictionary<string, string>(10);
for (int i = 0; i < numMsgs; i++)
{
var textGUID = Guid.NewGuid().ToString();
Msgs.Add(textGUID, "GUID = " + textGUID);
}
Random rnd = new Random();
var size = rnd.Next(1, 50);
IntArray = new int[size];
for (int j = 0; j < size; j++)
{
IntArray[j] = rnd.Next();
}
}
}
While the process is running I stop the kafka elected controller to simulate a server down.
Because the loop is done 2 million times there should be 2 million (or more because of at least once processing) messages in the topic.
I count the messages in the topic with the following command:
docker run --net=host --rm confluentinc/cp-kafka:3.3.1 kafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:29092 --topic testtopic --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}'
And I see anywhere from a hundred to several thousand less than 2 million.
Checklist
Please provide the following information:
- confluent-kafka-python and librdkafka version (
confluent_kafka.version(3.3.1)
andconfluent_kafka.libversion(11.2)
): - Apache Kafka broker version: confluentinc/cp-kafka:3.3.1
- Client configuration:
{...}
- Operating system: Kafka OS: confluentinc/cp-kafka:3.3.1 docker images on Ubuntu OS
- Provide client logs (with
'debug': '..'
as necessary) - Provide broker log excerpts
- Critical issue
Issue Analytics
- State:
- Created 6 years ago
- Comments:42 (21 by maintainers)
We’re working on formalising and fixing the retry behaviour in librdkafka, we’ll keep you posted.
@edenhill @mhowlett You guys are awesome. Thank you very much for your stellar community involvement.