Support a MinBatchSize property in web job event hub extensions
See original GitHub issueLibrary or service name. Microsoft.Azure.WebJobs.Extensions.EventHubs
Is your feature request related to a problem? Please describe. I have a scenario where I would like to aggregate events over time, and then handle them as a batch. The size of this batch must be pretty large for my downstream service’s optimizations, so for example let’s say 10k events.
Using the Event Hub SDK directly (Azure.Messaging.EventHubs) I could write a processor that aggregates incoming data into buckets (bucket-per-partition), and then when a bucket hits a certain event count threshold (or when enough time has passed since it was updated), I would “flush” the bucket, and then update the checkpoint for that particular partition. This way I am never at risk of data loss. This same approach cannot be taken when using the current webjob SDK, as it automatically updates the checkpoint after every X batches are “processed”, so if my process crashes before it can flush a bucket, all the data in that bucket is lost.
The current SDK has the property MaxBatchSize
which puts an upper limit on the amount of messages in a batch, but it isn’t related to the actual amount of unprocessed messages in a given partition, so even if I have thousands of events waiting to be processed, and my MaxBatchSize
is set to 10k, I can still receive batches of 5-7 messages per batch.
I am proposing a MinBatchSize
property that will tell the web job SDK to aggregate data in in-memory buckets. If set, the internal processor will fill the relevant bucket instead of triggering, and will not update the checkpoint. When a bucket hits the MinBatchSize
, or when the configured time span since the last event has expired, the processor will trigger the user-code and afterwards will update the checkpoint for the given partition.
As far as I understand, the “invoke after enough time has passed since the last event” behavior was removed in a recent PR (#19140), but it might be needed here
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:15 (9 by maintainers)
Top GitHub Comments
This is desperately needed in my opinion. We are deploying Function Apps and Event Hubs into all our subscriptions and in multiple regions for a logging solution. This works out to be thousands of Function Apps and Event Hubs. As we are migrating away from our current solution to Function Apps, I am seeing a concerning price increase for the Storage Accounts associated with the Function Apps (where checkpoints are stored) for subscriptions that are producing a large volume of logs to the Event Hubs.
The volume of logs that we receive into the Event Hubs can be anywhere from tens of messages per minute to millions of messages per minute. One subscription we have is producing logs into the Event Hub at a steady rate of around 100k messages/minute. In our
host.json
file, we are specifyingmaxBatchSize
as1000
but based on the logs in App Insights a vast majority (realistically almost all) of executions are only receiving a batch size of one. I am aware thatmaxBatchSize
is not a guarantee of how many messages you will receive in a batch. This is causing us to have 30k+ function executions/minute. That alone is pricey (even though the executions are < 1 second). This has also caused an excessive amount of transactions against the Storage Account - I believe it was something like 90k transactions/minute. I modified thebatchCheckpointFrequency
field from1
to5
and that helped cut down the transactions to < 30k/minute but I still find that excessive. We are required to have advanced threat protection turned on for all Storage Accounts and I believe I worked that out to be $40+ a day which is ridiculous. Looking at the price breakdown for the Storage Account, advanced threat protection and write operations were the most costly. This one Storage Account has now become the majority of the cost for our resource group.I would like to see a configuration available in
host.json
that would allow me to configure a minimum batch size (I am guaranteed to receive at least that many messages per execution) and a maximum batching size window (if minimum batch size is not met within some configurable amount of time, give me what you have so far). This is what AWS Lambda has with Kinesis and it is really useful. Link to some Terraform configuration that controls this here and here. This would hopefully cut down on the number of executions that my function has and ultimately reduce the number of transactions against the Storage Account. This would also help our code be more performant since the downstream service we are sending logs to can be a bottleneck. To get around this we batch up logs into one request before pushing but we aren’t getting that benefit if each function invocation is only dealing with one message at a time. As long as configurations are properly documented, I am happy to work through adjusting the values to meet my function’s needs.Also, I would appreciate any advice to help cut down on our Storage Account transactions until this would be implemented.
@JoshLove-msft: No, it hasn’t - its not something that we’ve seen feedback requesting.
For the majority case,
EventProcessorClient
is the processor type in use, which is single-dispatch for delivering events to handlers. For theEventProcessor<T>
that underpins the Functions extensions, the focus to has been to maximize throughput by dispatching as quickly as possible once any events are available in the prefetch queue.I think the Functions scenario is somewhat unique in that there’s cost associated with invocations. I’d be inclined to say that we should consider building this into the
EventProcessorHost
in the Functions bindings to start and reassess moving to the base class if we see a more general demand for it.