[FEATURE REQ] Better support for batch processing in Event Hubs SDK
See original GitHub issueLibrary or service name. [Azure.Messaging.EventHubs]
Is your feature request related to a problem? Please describe. (This issue is prompted by the following twitter thread)
Batch-processing when receiving events in the new EventHubs SDK is much, much more difficult than in the previous SDK. When switching over I had to re-implement a lot of the functionality you used to get out of the box.
My batching workflow.
Just to sum up what sort of workflow I’m working with in regards to batching.
- I would like to process messages in batches of n.
- If n messages haven’t been received within 1sec, I’d like to batch process the events that are already lying around. (We have some data sources that are intermittent and we can’t just leave messeages around because new ones aren’t coming in)
I’ve followed the example here in regards to batch processing lead me to the following issues.
Problem: Managing Partition State
We have to manually manage state per-partition. You used to be able to have the SDK do that for you for batching cases. In the example listed above it seems pretty easy - you just add a ConcurrentDictionary with the partition id and whatever you want to store right? However let’s take an example where it’s not so simple.
- Processor 1 acquires partition 1 and batches up 10 messages. This is below the processing limit.
- Processor 1 loses partition 1. The messages are still batched.
- At some later point in time, Processor 1 reacquires partition 1. The old messages that are left will then be reprocessed.
To avoid these sort of circumstances you’ll need to listen to all of the PartitionClosing
events and ensure that you synchronize your state with it.
Problem: Heartbeat messages do not allow checkpointing.
Next up - the case where I want to process my messages after n seconds without any activity. Luckily there’s the heartbeat message for this.
Unfortunately the heartbeat message doesn’t allow us to do e.g. checkpointing or read lastEnqueuedTime
, so I have to build up a structure that forces me to retrieve that from the last message I’ve enqueued.
private async Task ProcessHeartbeatEvent(IProcessEventArgs args, string partitionId, ICheckpointer checkpointer)
{
var data = _partitionedMessageBatcher.Drain(partitionId);
// If there is already no messages on the partition, this means we've gotten multiple
// heartbeat messages in a row, and there's no need to send a list of empty
// messages any further.
if (data.Count == 0)
{
return;
}
// The updateCheckpoint we get from Azure Event Hub are coupled to the event.
// When batching we always provide the updateCheckpoint from the latest event, except
// for when we receive heartbeat messages, in which case we use the updateCheckPoint from the last "real" event
var lastRealEvent = data.Last();
Func<CancellationToken, Task> lastUpdateCheckpointAsync = lastRealEvent.UpdateCheckpointAsync;
var lastPartitionContext =
lastRealEvent
.Partition; // Use partitionContext from last event as well, as the heartbeat message doesn't have the correct properties such as LastEnqueuedTime
var eventData = data
.Select(args =>
{
Debug.Assert(args.Data != null,
"args.Data != null"); // Args with null data shouldn't make it into the batcher
return args.Data!;
})
.ToList();
Log.Debug("Flushing {messageCount} messages due to heartbeat message", data.Count);
var receivedEventDataBatch =
new ReceivedEventDataBatch(lastPartitionContext, eventData, lastUpdateCheckpointAsync);
await _processEvent(receivedEventDataBatch, checkpointer);
}
Perhaps this is because my codebase is shaped by the old SDK where checkpointing was done on a batch basis rather than as a function provided by each event received.
Summing up
-
Batching is harder than it needs to be. I imagine this use-case is common and I would prefer if something was provided that helped you do batching, similar to the old-style SDK.
-
However I think in lieu of that, something that will help you manage partition state would be nice. I’m not quite sure how that would work however.
-
I would also like if you were able to call
UpdateCheckpointAsync
on a heartbeat message, and that would then checkpoint all the previous messages. -
If nothing else, a more involved example in the documentation would be nice.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:5
- Comments:13 (7 by maintainers)
Top GitHub Comments
Hi folks,
Apologies for the lack of updates. Thanks in no small part to the discussion on this issue, we were able to prove the need for a better story around extending the processor and batch support. Starting with our next release, v5.7.0-beta.5, we’ve made the following improvements:
The
Azure.Messaging.EventHubs
package now defines aCheckpointStore
type to normalize processor storage operations.The
Azure.Messaging.EventHubs
package includes aPluggableCheckpointStoreEventProcessor<T>
that can be extended with your processing logic without the need to implement storage operations.The Blob Storage implementation used by the
EventProcessorClient
is now public in theAzure.Messaging.EventHubs.Processor
asBlobCheckpointStore
, and can be used when extending processor types.All event processor types now expose a protected
UpdateCheckpoint
member that can types extending them can call. This new method does not require anEventData
instance to create the checkpoint, only an offset.More details can be found in this sample.
Thanks for the quick reply! I see… the problem with using the
EventProcessor<T>
is that it requires to rewrite\copy a lot of code which already exists inEventProcessorClient
, however, after looking on the code it seems that using theBlobsCheckpointStore
will make it much easier. But I was disappointed to find thatBlobsCheckpointStore
class marked asinternal
, could you consider change it to be public so it will be possible to consume it directly?