Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EventHub integration offset value errors

See original GitHub issue

Repro steps

Provide the steps required to reproduce the problem

Create an EventHub trigger integration.
Pause or delete integration for a period that exceeds the retention of the EventHub
Resume the integration / unpause the trigger.

Expected behavior

Upon resumption of the trigger, the stored offsets will be invalid. The EventHub trigger should compensate for this and be able to reset the offset.

In addition, upon the deletion of an input trigger, the corresponding blob data for offset checkpointing should be deleted from the storage account.

Actual behavior

Any partitions with invalid offsets will constantly produce errors from the AMQP consumer. The trigger never fixes these offsets and this error is not viewable from within the functions app logs, etc. It is only viewable (and thus Microsoft support cannot identify the problem either) with Application Insights.

e.g.

System.ArgumentException: The supplied offset '55838201792' is invalid. The last offset in the system is '30089580512' TrackingId:<redacted>_B14, SystemTracker:<redacted>:eventhub:<redacted>~12287, Timestamp:2019-01-18T04:56:57 Reference:<redacted>, TrackingId:<redacted>_B14, SystemTracker:<redacted>:eventhub:<redacted>~12287|$default, Timestamp:2019-01-18T04:56:57 TrackingId:<redacted>_G6, SystemTracker:gateway5, Timestamp:2019-01-18T04:56:57
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.PartitionReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.ReceivePumpAsync(CancellationToken cancellationToken, Boolean invokeWhenNoEvents)



System.OperationCanceledException: The AMQP object session36857 is aborted.
   at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
   at Microsoft.Azure.Amqp.AmqpObject.OpenAsyncResult.End(IAsyncResult result)
   at Microsoft.Azure.Amqp.AmqpObject.EndOpen(IAsyncResult result)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.CreateLinkAsync(TimeSpan timeout)
   at Microsoft.Azure.Amqp.FaultTolerantAmqpObject`1.OnCreateAsync(TimeSpan timeout)
   at Microsoft.Azure.Amqp.Singleton`1.CreateValue(TaskCompletionSource`1 tcs, TimeSpan timeout)
   at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.PartitionReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan waitTime)
   at Microsoft.Azure.EventHubs.Amqp.AmqpPartitionReceiver.ReceivePumpAsync(CancellationToken cancellationToken, Boolean invokeWhenNoEvents)

Known workarounds

We believe that deleting the blobs with the bad offsets will resolve the problem by causing the blob to be recreated.

Additional information

The Azure EventHub and Functions integration should to do two things:

Upon detecting an offset error, it needs to make a decision about what to do. That is to reset the offset checkpoint and probably (safest) to recapture from earliest data in that partition or to capture from the latest data. There might be value making this user-configurable.
When an EventHub trigger is deleted, the corresponding offset data should be deleted from the storage account.

Bonus: it would be nice if the user could see these errors in the logs of the functions app, but they do not appear there.

For the details:

The EventHub integration keeps offset data in a path in the storage account at: azure-webjobs-eventhub/<namespace>.servicebus.windows.net/<eventhub name>/<consumer group>/

In here, there is a file for each partition. The contents of the file are structure is as show below:

{"Offset":"<offset count>","SequenceNumber":<number>,"PartitionId":"0","Owner":"<uuid>","Token":"<uuid>","Epoch":<number>}

Issue Analytics

State:
Created 5 years ago
Reactions:10
Comments:33 (2 by maintainers)

Top GitHub Comments

8reactions

mbrancatocommented, Jan 23, 2019

Noise? No. This prevents messages from being ingested and processed by the function from any of the affected partitions until the offsets are fixed.

4reactions

mbrancatocommented, Mar 4, 2020

Hi @jeffhollan - my original problem was not due to deleting the EventHub. It was because the EventHub consumer was paused longer than the retention period. I just want to make clear that deleting storage, etc were attempts to fix the problem, not the cause. That said messing, with storage etc can land in the same state.

I think EventHub just needs to detect when the offset is invalid and cleanup the storage.