[BUG] Service Bus .NET SDK randomly fails to complete messages for long running process
See original GitHub issueDescribe the bug When a Service Bus topic message requires a long time to be processed (in my case between 15 and 20 minutes) and the respective MessageHandlerOptions was initialized as:
new MessageHandlerOptions(HandleException)
{
MaxAutoRenewDuration = TimeSpan.FromMinutes(30),
AutoComplete = false
}
when the processing has finished (always sooner than the TimeSpan set for MaxAutoRenewDuration, in my case processing took 18 mins and the MaxAutoRenewDuration was set to 30 mins) and attempting to complete the message an error occurs, as if the message lock has expired or the message has been already removed. Only a single process is working with the Service Bus topic at the given time.
Expected behavior My understanding is that based on the value of MaxAutoRenewDuration, the expected behavior would be to keep the message locked and automatically renew the lock for the duration value defined. When the processing is finished and before this duration expires, the message lock should be automatically renewed and valid to complete the message.
Actual behavior (include Exception or Stack Trace) When trying to complete the message after the long running process the exception was the following:
Microsoft.Azure.ServiceBus.MessageLockLostException: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue, or was received by a different receiver instance.
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.DisposeMessagesAsync(IEnumerable`1 lockTokens, Outcome outcome)
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.CompleteAsync(IEnumerable`1 lockTokens)
at ServiceBusHandler.Program.HandleLongRunningMessage(Message message, CancellationToken cancellationToken) in C:\[PROGRAM_LOCATION]\Program.cs:line 46
at Microsoft.Azure.ServiceBus.MessageReceivePump.MessageDispatchTask(Message message)
and sometimes in the middle of the processing I got the following exception (slightly different, when trying to renew the lock I would guess):
Microsoft.Azure.ServiceBus.MessageLockLostException: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue. Reference:38e353cf-423a-4421-bdc0-207ca76c7244, TrackingId:1474382e-1a29-4800-b599-c176d1559804_B1, SystemTracker:gsis-einvoice:Topic:longtopic|longsub, Timestamp:2020-04-23T06:26:07
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.OnRenewLockAsync(String lockToken)
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.<>c__DisplayClass74_0.<<RenewLockAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.RenewLockAsync(String lockToken)
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.RenewLockAsync(Message message)
at Microsoft.Azure.ServiceBus.MessageReceivePump.RenewMessageLockTask(Message message, CancellationToken renewLockCancellationToken)
To Reproduce I am including a console program code in NET Core 3.1 in the following gist using which I was able to reproduce the problem a couple of times when I ran it for 10 messages:
https://gist.github.com/nianton/a1b64094d79c0da4f15037ce301d1b23
Environment:
- Azure Service Bus resource
- Location: West Europe
- Tier: Standard
- Message time to live: 14 days
- Message lock duration: 5 mins
- Microsoft.Azure.ServiceBus v4.1.3
- Windows 10, .NET Core v3.1.3
- Visual Studio 16.5.2
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:14 (6 by maintainers)
Top GitHub Comments
@nianton Locks in Service Bus is volatile. The longer the processing of the message, the more susceptible it is to events that may cause the lock to be lost.
Some of these events are - Client restarts, service restarts, connection breaks, etc. In this case the message is locked to the specific connection and when you reconnect and renew or complete the message the lock is no longer held by the client. The SDK’s retry logic will ensure you can reconnect, but realistically, you may have the random lock lost exceptions.
hope this helps.
Hey @sikemullivan
The reply from the Service Bus team in https://github.com/Azure/azure-sdk-for-net/issues/11533#issuecomment-716899419 explains the potential reasons for losing locks. This is by design and there are no fixes that can be made to the SDK to work around this. Therefore, we closed the issue.
If the connection breaks in between these 2 hours, there is no way supported by the service today to regain the locks.
This would be feature request for the service which can be logged at https://github.com/Azure/azure-service-bus
Also, a similar discussion is happening over at #8291. Please consider subscribing there and adding your inputs