Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Producer backpressure semantic is not async-friendly

See original GitHub issue

Description

I am thinking how to implement producer correctly in high load async environment. Producer has 2 main modes, callback based (IDeliveryHandler) and Task based. Both have blockIfQueueFull parameter.

Lets start with task mode. If I do the following:

foreach (var n in batch) {
    var message = await producer.ProduceAsync(topic, null, n, true).
        ContinueWith(t => Interlocked.Increment(ref counter));
}

then I effectively produce one message per batch, which is no go. So, I need to accumulate messages and await on batch, which leads to something like this:

var stream = Enumerable.Range(1, batchSize).Select(i => i.ToString());
var tasks = stream.Select(async msg => await producer.ProduceAsync(topic, null, msg, true).
       ContinueWith(t => Interlocked.Increment(ref counter)));
await Task.WhenAll(tasks);

But this is problematic too. If I have an infinite stream of source messages, then I do not know when my “tasks” will end (never), so i’ll have to do linq batching.

My another worry is that produceAsync will block, but I do not quite understand how this will interplay with await portion of await producer.ProduceAsync. Are there any chances that Task scheduler will see blocking task, and will schedule new and new threads because from Task Scheduler perspective it is just long running task and it will try to schedule as many parrallel tasks as possible without being aware that this blocked task is actually a backpressure? Would it lead to task exhaust situation?

Bottom line is, i do not see, how use Task based API because we need to await for 2 different type of async flows: backpressuire and task delivery. And sync and async do not mix well. I have intuition that librdkafka blocking call should be converted to a task which can be awaited for. Does this sound reasonable?

Let’s consider delivery handler option. It is more appealing to me beacuse generating a task per message and then awaiting for every one of them sounds sub-optimal to me. Start with this implementation:

        class DeliveryHandler : IDeliveryHandler<Null, String>
        {
            public int counter = 0;
            public bool MarshalData => true;

            public void HandleDeliveryReport(Message<Null, string> message)
            {
                Interlocked.Increment(ref counter);
            }
        }

........
            var deliveryHandler = new DeliveryHandler();
            Task.Run(() =>
            {
                using (var producer = new Producer<Null, string>(config, null, new StringSerializer(Encoding.UTF8)))
                {
                    var stream = Enumerable.Range(1, batchSize).Select(i => i.ToString());
                    foreach (var msg in stream)
                    {
                        producer.ProduceAsync(topic, null, msg, true, deliveryHandler);
                    }
                }
            }).Wait();

Now producing is synchrounous and batch overflow blocking call is propagated to the caller naturally. But there are problems: if you are operating in async-heavy environment, you still want buffer overflow to be exposed as awaitable entity, so you have to make more work to convert it to async. And second problem is, that now I have no way of knowing about tasks failure. How do I handle failed tasks in Delivery Handler?

Checklist

Please provide the following information:

Confluent.Kafka nuget version: 0.11.3
Apache Kafka version:
Client configuration:
Operating system:
Provide logs (with “debug” : “…” as necessary in configuration)
Provide broker log excerpts
Critical issue

Issue Analytics

State:
Created 5 years ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

hodzanassredincommented, Dec 11, 2021

I’m trying to figure out why this code hangs(deadlock)? Do I have to use Thread.Sleep?

        public async Task SendMessage(string key, TMessage msg)
        {
            //logger.LogDebug("Sending message into topic {_topic} key {key}", _topic, key);
            var m = new Message<string, TMessage> { Key = key, Value = msg };
            if (_producer == null) {
                logger.LogWarning("Producer for topic is null. Trying to recreate {@details}", new { topic = _topic });
                PreStart();
            }
            while (true)
            {
                try
                {
                    _producer.Produce(_topic, m);
                    break;
                }
                catch (ProduceException<string, TMessage> ex)
                {
                    if (ex.Error.Code == ErrorCode.Local_QueueFull)
                    {
                        await Task.Delay(50).ConfigureAwait(false);//this line is a problem
                    }
                    else throw;
                }
            }
            
            metrics.IncMessagesSent(_topic);
        }

            foreach (var state in states)
            {
                await deviceStateKafkaProducer.SendMessage(state.DeviceId, state);
            }
            deviceStateKafkaProducer.Flush();//hangs

P.S. I can reproduce this problem only on Linux.

Update: This happens in xunit tests with AsyncTestSyncContext and ThreadPoolScheduler

Update2: After investigation I found that there are a lot of error messages: Kafka producer error: “rdkafka#producer-10” Error { Code: Local_Transport, IsFatal: False, Reason: “localhost:29099/bootstrap: Connect to ipv4#127.0.0.1:29099 failed: Connection refused (after 0ms in state CONNECT)”, IsError: True, IsLocalError: True, IsBrokerError: False }

Update 3 It was a problem with the latest version of confluentinc/cp-kafka image. 😃

0reactions

mhowlettcommented, Jul 26, 2018

Thanks again @vchekan and @robinreeves for your thoughts here. I’ve had a chance to think this through now, and my thoughts are below.

I now believe that both ProduceAsync and BeginProduce (as it’s now called in 1.0) should never block - about to make this change on the 1.0-experimental branch. They should always throw a KafkaException if invoked and the librdkafka message queue is full (error code Local_QueueFull) - the user can catch this and wait in a synchronous or asynchronous way themselves if they want.

The blocking behavior was intended as a convenience, but i don’t think that it is. first, great discussion from you wondering about the interplay with the scheduler. second, you wouldn’t want to use the blocking behavior in say a web request handler, since you don’t want to block the thread. third, it’s not typical, so even if users are aware of the behavior (note - it’s no longer a parameter on the method, it’s a config parameter, which is more hidden), they need to think through an un-usual pattern (as you are doing).

I don’t think the librdkafka blocking call should be converted into a Task either since the point of having limits on librdkafka queue size is (presumably?) related to limiting memory usage. If we want to be able to have more simultaneous stuff in flight (which is effectively what awaiting a blocked call would achieve), just increase the limits in librdkafka instead.

I see the Task based method as mostly useful in scenarios where you want to await each task separately, e.g. web requests where you want parallelism across many simultaneous requests. I think if you find yourself wanting to batch up results, it’s a wrong fit. I think something along the lines of the below is how to best produce an infinite stream at high speed:

Action<DeliveryReport<Null, string>> handler = (DeliveryReport<Null, string> dr) => { count += 1; }; // no interlock required.

using (var producer = new Producer<Null, string>(config, null, new StringSerializer(Encoding.UTF8)))
{
    var stream = Enumerable.Range(1, bigNumber).Select(i => i.ToString());
    foreach (var msg in stream)
    {
        while (true)
        {
            try
            {
                producer.BeginProduce("my_topic", new Message<Null, string> { Value = msg }, handler);
                break;
            }
            catch (KafkaException e)
            {
                if (e.Error.Code == ErrorCode.Local_QueueFull)
                {
                    Thread.Sleep(100);
                    continue;
                }
                
                Console.WriteLine(e.Error);
                
                // a non-retryable error occured. what is appropriate to do here will be application
                // specific and may depend on the error code.
                break;
            }
        }
    }

    producer.Flush(TimeSpan.FromSeconds(500));
}

Another thing to note is the Task based methods are less performant/consume more resources than the Action<DeliveryReport> methods, so if you care about performance BeginProduce is better. But the difference isn’t that much.

I’m going to close this now since it’s been open a long time, but feel free to comment further / re-open if you don’t agree with the new proposed behavior when the librdkafka queue fills up.

Top Results From Across the Web

I'm not feeling the async pressure

Backpressure is about not accepting work when there is not capacity to ... The semantics and feel is nearly the same, the look...

SE-0314: AsyncStream and AsyncThrowingStream

Hello Swift community, The review of SE-0314 "AsyncStream and AsyncThrowingStream" begins now and runs through May 25, 2021.

r/Clojure - "core.async does not work with ClojureScript"

There's no backpressure handling if you can't loose messages. Every library that I tested that implements a core.async layer over callbacks ...

work with sockets using core.async channels

This library allows you to create socket servers and socket clients and interact with them asynchronously using channels. Servers return a record with...

Node.js async streams backpressure not working

I have two async processes, one that produces data and one data consumes data. They work at different rates, so my idea was...