[BUG] Message lost when attempting to send large batch of messages within a transaction
See original GitHub issueLibrary name and version
Azure.Messaging.ServiceBus 7.15.0, 7.11.1, and others
Describe the bug
We are working on the NServiceBus Azure Service Bus transport. We have encountered a case when users attempt to send more than 100 messages in a single transaction. We realize this exceeds the quota but we rely on the service to expose that error so that we can catch it and engage our error handling logic.
Our error handler opens a new transaction, sends a copy of the incoming message to the error queue and acknowledges the incoming message. This usually works, however we have seen scenarios where it fails and the message does not go to the error queue. It gets stuck in the transfer queue. Sometimes it gets delivered to the error queue, but sometimes it is completely lost. We have observed it getting stuck for hours.
When the message ends up in the transfer queue, we can observe a TransactionDischargeException
event on the first transaction (where we are sending more than 100 messages). The second one seems to complete successfully, but the message does not move to the error queue as expected. Instead it remains in the input queue, not as an active message, but as in transfer.
We have built a simple reproduction sample that surfaces this behavior here.
Expected behavior
The message should be copied into the error queue, and removed from the input queue. It should not be stuck in transfer.
Actual behavior
When we start the first transaction, which tries to send a lot of messages:
07:26:36:258 EVENT: TransactionDeclared
AmqpTransactionDeclared for LocalTransactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:1 AmqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760805_G9.
transactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:1
amqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760805_G9
When the first transaction fails:
07:27:50:718 EVENT: TransactionDischargeException
AmqpTransactionDischargeException for TransactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:1 AmqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760805_G9 Exception: Azure.Messaging.ServiceBus.ServiceBusException: The operation did not complete within the allocated time 00:01:00 for object message. For more information on exception types and proper exception handling, please refer to http://go.microsoft.com/fwlink/?LinkId=761101 Reference:dd316a6c-a747-46d7-957c-734c28195f67, TrackingId:e783359f-305b-4c78-821c-f8673b0db994_G9, SystemTracker:gtm, Timestamp:2023-06-28T07:27:40 (ServiceTimeout). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot..
transactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:1
amqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760805_G9
exception: Azure.Messaging.ServiceBus.ServiceBusException: The operation did not complete within the allocated time 00:01:00 for object message. For more information on exception types and proper exception handling, please refer to http://go.microsoft.com/fwlink/?LinkId=761101 Reference:dd316a6c-a747-46d7-957c-734c28195f67, TrackingId:e783359f-305b-4c78-821c-f8673b0db994_G9, SystemTracker:gtm, Timestamp:2023-06-28T07:27:40 (ServiceTimeout). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
When we start the second transaction:
07:27:53:978 EVENT: TransactionDeclared
AmqpTransactionDeclared for LocalTransactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:2 AmqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760931_G9.
transactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:2
amqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760931_G9
When the second transaction completes:
07:27:54:392 EVENT: TransactionDischarged
AmqpTransactionDischarged for LocalTransactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:2 AmqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760931_G9 Rollback: False.
transactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:2
amqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760931_G9
rollback: False
Final state:
#'0' messages in 'repro-queue'
#'1' transfer messages in 'repro-queue'
#'0' messages in 'repro-destination'
#'0' transfer messages in 'repro-destination'
#'0' messages in 'repro-error'
#'0' transfer messages in 'repro-error'
Reproduction Steps
- Download this reproduction code
- Create an environment variable named
AzureServiceBus_ConnectionString
with a valid Azure Service Bus connection string - Run the sample
- Check the error queue
repro-error
to confirm that the message is not present - Check the input queue
repro-queue
to confirm that the message is not active but is stuck in transfer.
Note that if we lower the message count to 105, we cannot reproduce this behavior. Even though we are still over the threshold, the message moves to the error queue as expected and is not stuck in transfer.
Environment
Output of dotnet --info
.NET SDK:
Version: 7.0.304
Commit: 7e794e2806
Runtime Environment:
OS Name: Windows
OS Version: 10.0.22621
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\7.0.304\
Host:
Version: 7.0.7
Architecture: x64
Commit: 5b20af47d9
.NET SDKs installed:
2.1.512 [C:\Program Files\dotnet\sdk]
2.2.105 [C:\Program Files\dotnet\sdk]
6.0.101 [C:\Program Files\dotnet\sdk]
6.0.118 [C:\Program Files\dotnet\sdk]
7.0.304 [C:\Program Files\dotnet\sdk]
.NET runtimes installed:
Microsoft.AspNetCore.All 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.2.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.2.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.30 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 6.0.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 6.0.10 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 6.0.18 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 7.0.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.2.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.30 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.14 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 6.0.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 6.0.10 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 6.0.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 6.0.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 6.0.18 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 7.0.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.WindowsDesktop.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.14 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 6.0.1 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 6.0.10 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 6.0.15 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 6.0.18 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 7.0.7 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Other architectures found:
x86 [C:\Program Files (x86)\dotnet]
registered at [HKLM\SOFTWARE\dotnet\Setup\InstalledVersions\x86\InstallLocation]
Project file
<TargetFramework>net6.0</TargetFramework>
Issue Analytics
- State:
- Created 3 months ago
- Comments:16 (12 by maintainers)
Top GitHub Comments
Yeah the new code does that already. It reduces the likelyhood of the problem to occur but the issue can still be observed as @lailabougria highlighted in https://github.com/Azure/azure-sdk-for-net/issues/37265#issuecomment-1611342020
Our current best workaround we have put in place is doing a client side check of the transaction limit which we would have to adjust when rhe service would change but that’s an OK tradeoff for now to prevent message loss
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @shankarsama @DorothySun216 @EldertGrootenboer @saglodha.