Idle downstream actors until confirmation is send
See original GitHub issueWatching the KafkaConsumerActor API I understand that the only way of process N messages batches from the same topic concurrently is creating N KafkaConsumerActor with the same groupId and each one with his own corresponding downstream actor. From KafkaConsumerActor.
Before the actor continues pulling more data from Kafka, the receiver of the data must confirm the batches by sending back a [[KafkaConsumerActor.Confirm]] message that contains the offsets from the received batch.
If we want to achieve at-least-once semantics we must wait until each batch processing succeeds in order to send the Confirm message with the offsets.
So, although the downstream actor is not blocked (assuming we’ve done things right processing the batch async, e.g in a Future) it won’t receive more batches until the Confirmation(offsets) is send and I see this being equal to blocking the actor.
Suppose the first batch contains messages
first message
second message
third message
I writes this messages to a Mongo DB in a Future and send a confirmation message to the KafkaConsumerActor when this Futures succeeds.
If new messages comes
fourth message
fifth message
I would like to be able to process this new messages even if the first write to Mongo hasn’t finished.
Until Mongo confirm the writes the downstream actor will be in a idle state, so why KafkaConsumerActor couldn’t keep sending new batches?
I understand that this works as a mechanism to avoid overwhelming the downstream actor
This mechanism allows the receiver to control the maximum rate of messages it will receive.
But we couldn’t configure a threshold of batches that can be sent without confirmation? Something like split the unconfirmed state in unconfirmedThresholdUnreached and unconfirmedThresholdReached.
If creating N KafkaConsumerActor + downstream actor (with same groupid) is the way to go, I must choose a fixed number of actors?
I hope I’ve made myself clear. Thanks!
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Great discussion. I think the multiple batch idea proposed by @gabrielgiussi would certainly improve the performance of a single stream in the scenario as described. There is a clear latency introduced in awaiting the downstream system’s (mongo) confirmation of the batch before getting the next one.
Providing a capability to process multiple batches concurrently does introduce some complexity however, which is described here: (https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html) in section “Decouple Consumption and Processing”. Specifically it becomes tricky to keep the commit position consistent and ordering guarantees are lost.
It would seem more reasonable to me to go with the original suggestion of using multiple KafkaConsumerActors with the same “groupId” to achieve the performance optimisation (downstream batch parallelism), rather than add the additional configuration and implementation complexity to the KafkaConsumerActor. Since the ordering of the stream could not easily be guaranteed using the multiple batch technique, it makes sense to break up the stream into multiple ones and lean on the group capabilities already provided by the underlying driver.
Good point @simonsouter.
So, the next thing I’ve got to do is choose between use a fixed number of KafkaConsumerActor created at application startup or create KafkaConsumerActors as needed, this requires some mechanism that let me know when my ReceiverActors are overwhelmed (could KafkaConsumerActors constantly entering in bufferFull state act as a signal of that?) I’m thinking in elasticity here, e.g be capable of processing a peak load of Kafka messages buy maybe I’m going to far.