Producer backpressure semantic is not async-friendly
See original GitHub issueDescription
I am thinking how to implement producer correctly in high load async environment. Producer has 2 main modes, callback based (IDeliveryHandler) and Task based. Both have blockIfQueueFull parameter.
Lets start with task mode. If I do the following:
foreach (var n in batch) {
var message = await producer.ProduceAsync(topic, null, n, true).
ContinueWith(t => Interlocked.Increment(ref counter));
}
then I effectively produce one message per batch, which is no go. So, I need to accumulate messages and await on batch, which leads to something like this:
var stream = Enumerable.Range(1, batchSize).Select(i => i.ToString());
var tasks = stream.Select(async msg => await producer.ProduceAsync(topic, null, msg, true).
ContinueWith(t => Interlocked.Increment(ref counter)));
await Task.WhenAll(tasks);
But this is problematic too. If I have an infinite stream of source messages, then I do not know when my “tasks” will end (never), so i’ll have to do linq batching.
My another worry is that produceAsync
will block, but I do not quite understand how this will interplay with await
portion of await producer.ProduceAsync
. Are there any chances that Task scheduler will see blocking task, and will schedule new and new threads because from Task Scheduler perspective it is just long running task and it will try to schedule as many parrallel tasks as possible without being aware that this blocked task is actually a backpressure? Would it lead to task exhaust situation?
Bottom line is, i do not see, how use Task based API because we need to await for 2 different type of async flows: backpressuire and task delivery. And sync and async do not mix well. I have intuition that librdkafka blocking call should be converted to a task which can be awaited for. Does this sound reasonable?
Let’s consider delivery handler option. It is more appealing to me beacuse generating a task per message and then awaiting for every one of them sounds sub-optimal to me. Start with this implementation:
class DeliveryHandler : IDeliveryHandler<Null, String>
{
public int counter = 0;
public bool MarshalData => true;
public void HandleDeliveryReport(Message<Null, string> message)
{
Interlocked.Increment(ref counter);
}
}
........
var deliveryHandler = new DeliveryHandler();
Task.Run(() =>
{
using (var producer = new Producer<Null, string>(config, null, new StringSerializer(Encoding.UTF8)))
{
var stream = Enumerable.Range(1, batchSize).Select(i => i.ToString());
foreach (var msg in stream)
{
producer.ProduceAsync(topic, null, msg, true, deliveryHandler);
}
}
}).Wait();
Now producing is synchrounous and batch overflow blocking call is propagated to the caller naturally. But there are problems: if you are operating in async-heavy environment, you still want buffer overflow to be exposed as awaitable entity, so you have to make more work to convert it to async. And second problem is, that now I have no way of knowing about tasks failure. How do I handle failed tasks in Delivery Handler?
Checklist
Please provide the following information:
- Confluent.Kafka nuget version: 0.11.3
- Apache Kafka version:
- Client configuration:
- Operating system:
- Provide logs (with “debug” : “…” as necessary in configuration)
- Provide broker log excerpts
- Critical issue
Issue Analytics
- State:
- Created 5 years ago
- Comments:12 (5 by maintainers)
I’m trying to figure out why this code hangs(deadlock)? Do I have to use Thread.Sleep?
P.S. I can reproduce this problem only on Linux.
Update: This happens in xunit tests with AsyncTestSyncContext and ThreadPoolScheduler
Update2: After investigation I found that there are a lot of error messages: Kafka producer error: “rdkafka#producer-10” Error { Code: Local_Transport, IsFatal: False, Reason: “localhost:29099/bootstrap: Connect to ipv4#127.0.0.1:29099 failed: Connection refused (after 0ms in state CONNECT)”, IsError: True, IsLocalError: True, IsBrokerError: False }
Update 3 It was a problem with the latest version of confluentinc/cp-kafka image. 😃
Thanks again @vchekan and @robinreeves for your thoughts here. I’ve had a chance to think this through now, and my thoughts are below.
I now believe that both
ProduceAsync
andBeginProduce
(as it’s now called in 1.0) should never block - about to make this change on the 1.0-experimental branch. They should always throw aKafkaException
if invoked and the librdkafka message queue is full (error codeLocal_QueueFull
) - the user can catch this and wait in a synchronous or asynchronous way themselves if they want.The blocking behavior was intended as a convenience, but i don’t think that it is. first, great discussion from you wondering about the interplay with the scheduler. second, you wouldn’t want to use the blocking behavior in say a web request handler, since you don’t want to block the thread. third, it’s not typical, so even if users are aware of the behavior (note - it’s no longer a parameter on the method, it’s a config parameter, which is more hidden), they need to think through an un-usual pattern (as you are doing).
I don’t think the librdkafka blocking call should be converted into a
Task
either since the point of having limits on librdkafka queue size is (presumably?) related to limiting memory usage. If we want to be able to have more simultaneous stuff in flight (which is effectively what awaiting a blocked call would achieve), just increase the limits in librdkafka instead.I see the
Task
based method as mostly useful in scenarios where you want to await each task separately, e.g. web requests where you want parallelism across many simultaneous requests. I think if you find yourself wanting to batch up results, it’s a wrong fit. I think something along the lines of the below is how to best produce an infinite stream at high speed:Another thing to note is the
Task
based methods are less performant/consume more resources than theAction<DeliveryReport>
methods, so if you care about performanceBeginProduce
is better. But the difference isn’t that much.I’m going to close this now since it’s been open a long time, but feel free to comment further / re-open if you don’t agree with the new proposed behavior when the librdkafka queue fills up.