question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Message lost when attempting to send large batch of messages within a transaction

See original GitHub issue

Library name and version

Azure.Messaging.ServiceBus 7.15.0, 7.11.1, and others

Describe the bug

We are working on the NServiceBus Azure Service Bus transport. We have encountered a case when users attempt to send more than 100 messages in a single transaction. We realize this exceeds the quota but we rely on the service to expose that error so that we can catch it and engage our error handling logic.

Our error handler opens a new transaction, sends a copy of the incoming message to the error queue and acknowledges the incoming message. This usually works, however we have seen scenarios where it fails and the message does not go to the error queue. It gets stuck in the transfer queue. Sometimes it gets delivered to the error queue, but sometimes it is completely lost. We have observed it getting stuck for hours.

When the message ends up in the transfer queue, we can observe a TransactionDischargeException event on the first transaction (where we are sending more than 100 messages). The second one seems to complete successfully, but the message does not move to the error queue as expected. Instead it remains in the input queue, not as an active message, but as in transfer.

We have built a simple reproduction sample that surfaces this behavior here.

Expected behavior

The message should be copied into the error queue, and removed from the input queue. It should not be stuck in transfer.

Actual behavior

When we start the first transaction, which tries to send a lot of messages:

07:26:36:258 EVENT: TransactionDeclared
AmqpTransactionDeclared for LocalTransactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:1 AmqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760805_G9.
        transactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:1
        amqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760805_G9

When the first transaction fails:

07:27:50:718 EVENT: TransactionDischargeException
AmqpTransactionDischargeException for TransactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:1 AmqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760805_G9 Exception: Azure.Messaging.ServiceBus.ServiceBusException: The operation did not complete within the allocated time 00:01:00 for object message. For more information on exception types and proper exception handling, please refer to http://go.microsoft.com/fwlink/?LinkId=761101 Reference:dd316a6c-a747-46d7-957c-734c28195f67, TrackingId:e783359f-305b-4c78-821c-f8673b0db994_G9, SystemTracker:gtm, Timestamp:2023-06-28T07:27:40 (ServiceTimeout). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot..
        transactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:1
        amqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760805_G9
        exception: Azure.Messaging.ServiceBus.ServiceBusException: The operation did not complete within the allocated time 00:01:00 for object message. For more information on exception types and proper exception handling, please refer to http://go.microsoft.com/fwlink/?LinkId=761101 Reference:dd316a6c-a747-46d7-957c-734c28195f67, TrackingId:e783359f-305b-4c78-821c-f8673b0db994_G9, SystemTracker:gtm, Timestamp:2023-06-28T07:27:40 (ServiceTimeout). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.

When we start the second transaction:

07:27:53:978 EVENT: TransactionDeclared
AmqpTransactionDeclared for LocalTransactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:2 AmqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760931_G9.
        transactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:2
        amqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760931_G9

When the second transaction completes:

07:27:54:392 EVENT: TransactionDischarged
AmqpTransactionDischarged for LocalTransactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:2 AmqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760931_G9 Rollback: False.
        transactionId: bd9ec3f2-fbaf-4d47-8ad1-03f0e83a34e9:2
        amqpTransactionId: txn:ab7aae38a8a941ec94a7d4ed9096e54a__G9:760931_G9
        rollback: False

Final state:

#'0' messages in 'repro-queue'
#'1' transfer messages in 'repro-queue'
#'0' messages in 'repro-destination'
#'0' transfer messages in 'repro-destination'
#'0' messages in 'repro-error'
#'0' transfer messages in 'repro-error'

Reproduction Steps

  1. Download this reproduction code
  2. Create an environment variable named AzureServiceBus_ConnectionString with a valid Azure Service Bus connection string
  3. Run the sample
  4. Check the error queue repro-error to confirm that the message is not present
  5. Check the input queue repro-queue to confirm that the message is not active but is stuck in transfer.

Note that if we lower the message count to 105, we cannot reproduce this behavior. Even though we are still over the threshold, the message moves to the error queue as expected and is not stuck in transfer.

Environment

Output of dotnet --info

.NET SDK:
 Version:   7.0.304
 Commit:    7e794e2806

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22621
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\7.0.304\

Host:
  Version:      7.0.7
  Architecture: x64
  Commit:       5b20af47d9

.NET SDKs installed:
  2.1.512 [C:\Program Files\dotnet\sdk]
  2.2.105 [C:\Program Files\dotnet\sdk]
  6.0.101 [C:\Program Files\dotnet\sdk]
  6.0.118 [C:\Program Files\dotnet\sdk]
  7.0.304 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.All 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.2.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.2.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.30 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.10 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.18 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.2.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.30 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 5.0.14 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.10 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.18 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 5.0.14 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.1 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.10 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.15 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.18 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 7.0.7 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Other architectures found:
  x86   [C:\Program Files (x86)\dotnet]
    registered at [HKLM\SOFTWARE\dotnet\Setup\InstalledVersions\x86\InstallLocation]

Project file

<TargetFramework>net6.0</TargetFramework>

Issue Analytics

  • State:open
  • Created 3 months ago
  • Comments:16 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
danielmarbachcommented, Jul 1, 2023

but we would recommend sending multiple messages in a batch rather than queueing concurrent sends of 1 message each.

Yeah the new code does that already. It reduces the likelyhood of the problem to occur but the issue can still be observed as @lailabougria highlighted in https://github.com/Azure/azure-sdk-for-net/issues/37265#issuecomment-1611342020

Our current best workaround we have put in place is doing a client side check of the transaction limit which we would have to adjust when rhe service would change but that’s an OK tradeoff for now to prevent message loss

1reaction
github-actions[bot]commented, Jun 30, 2023

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @shankarsama @DorothySun216 @EldertGrootenboer @saglodha.

Read more comments on GitHub >

github_iconTop Results From Across the Web

AWS SQS messages getting lost and not processed
I have simple setup. Lets say I send 50 messages to my SQS but the amount of processed messages is random. Sometimes processes...
Read more >
Email non-delivery reports and SMTP errors in Exchange ...
Admins can learn about SMTP errors and non-delivery reports (also known as NDRs or Bounce Messages) that are generated in Exchange Online.
Read more >
Resolving problems when using messages
Solution: Restart the WebSphere MQ queue manager. The enqueue facility is not picking up changes made to a message. Procedure. Scenario: ...
Read more >
Microservices 101: Transactional Outbox and Inbox
Usually, it's essential to ensure that the sent message reaches its destination and losing it could yield serious business implications.
Read more >
Working with Amazon SQS messages
Working with Amazon SQS messages · Processing messages in a timely manner · Handling request errors · Setting up long polling · Capturing...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found