EventHub integration offset value errors
See original GitHub issueRepro steps
Provide the steps required to reproduce the problem
-
Create an EventHub trigger integration.
-
Pause or delete integration for a period that exceeds the retention of the EventHub
-
Resume the integration / unpause the trigger.
Expected behavior
Upon resumption of the trigger, the stored offsets will be invalid. The EventHub trigger should compensate for this and be able to reset the offset.
In addition, upon the deletion of an input trigger, the corresponding blob data for offset checkpointing should be deleted from the storage account.
Actual behavior
Any partitions with invalid offsets will constantly produce errors from the AMQP consumer. The trigger never fixes these offsets and this error is not viewable from within the functions app logs, etc. It is only viewable (and thus Microsoft support cannot identify the problem either) with Application Insights.
e.g.
System.ArgumentException: The supplied offset '55838201792' is invalid. The last offset in the system is '30089580512' TrackingId:<redacted>_B14, SystemTracker:<redacted>:eventhub:<redacted>~12287, Timestamp:2019-01-18T04:56:57 Reference:<redacted>, TrackingId:<redacted>_B14, SystemTracker:<redacted>:eventhub:<redacted>~12287|$default, Timestamp:2019-01-18T04:56:57 TrackingId:<redacted>_G6, SystemTracker:gateway5, Timestamp:2019-01-18T04:56:57
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.PartitionReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.ReceivePumpAsync(CancellationToken cancellationToken, Boolean invokeWhenNoEvents)
System.OperationCanceledException: The AMQP object session36857 is aborted.
at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
at Microsoft.Azure.Amqp.AmqpObject.OpenAsyncResult.End(IAsyncResult result)
at Microsoft.Azure.Amqp.AmqpObject.EndOpen(IAsyncResult result)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.CreateLinkAsync(TimeSpan timeout)
at Microsoft.Azure.Amqp.FaultTolerantAmqpObject`1.OnCreateAsync(TimeSpan timeout)
at Microsoft.Azure.Amqp.Singleton`1.CreateValue(TaskCompletionSource`1 tcs, TimeSpan timeout)
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.PartitionReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.ReceivePumpAsync(CancellationToken cancellationToken, Boolean invokeWhenNoEvents)
Known workarounds
We believe that deleting the blobs with the bad offsets will resolve the problem by causing the blob to be recreated.
Additional information
The Azure EventHub and Functions integration should to do two things:
- Upon detecting an offset error, it needs to make a decision about what to do. That is to reset the offset checkpoint and probably (safest) to recapture from earliest data in that partition or to capture from the latest data. There might be value making this user-configurable.
- When an EventHub trigger is deleted, the corresponding offset data should be deleted from the storage account.
Bonus: it would be nice if the user could see these errors in the logs of the functions app, but they do not appear there.
For the details:
The EventHub integration keeps offset data in a path in the storage account at:
azure-webjobs-eventhub/<namespace>.servicebus.windows.net/<eventhub name>/<consumer group>/
In here, there is a file for each partition. The contents of the file are structure is as show below:
{"Offset":"<offset count>","SequenceNumber":<number>,"PartitionId":"0","Owner":"<uuid>","Token":"<uuid>","Epoch":<number>}
Issue Analytics
- State:
- Created 5 years ago
- Reactions:10
- Comments:33 (2 by maintainers)
Top GitHub Comments
Noise? No. This prevents messages from being ingested and processed by the function from any of the affected partitions until the offsets are fixed.
Hi @jeffhollan - my original problem was not due to deleting the EventHub. It was because the EventHub consumer was paused longer than the retention period. I just want to make clear that deleting storage, etc were attempts to fix the problem, not the cause. That said messing, with storage etc can land in the same state.
I think EventHub just needs to detect when the offset is invalid and cleanup the storage.