Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[EventHubs] Some unit tests are flaky or fail constantly

See original GitHub issue

Summary

Four of the Event Processor Client tests are not working as expected: one of them is already tracked by #9228, while the other three are tracked by this issue.

PartitionClosingAsyncIsCalledWithOwnershipLostReasonWhenStoppingTheFailedProcessor is flaky. Sometimes the test hangs for indefinite time until it gets canceled. Sometimes it works. The flakiness is believed to have been introduced after the creation of the PartitionLoadBalancer class.
PartitionClosingAsyncTokenIsCanceledWhenStopProcessingAsyncIsCalled always fails. Reason is unknown.
ProcessErrorAsyncIsTriggeredWithCorrectArgumentsWhenOwnershipClaimFails always fails, but the cause is known.

https://github.com/Azure/azure-sdk-for-net/blob/1408ad9db2579f2943043549c6af70b98f9eb7fe/sdk/eventhub/Azure.Messaging.EventHubs.Processor/src/EventProcessorClient.cs#L949-L957

The Event Processor Client cannot figure out which partition failed during the RunLoadBalancingAsync call and it’s passing a null to ProcessErrorEventArgs instead of passing the partition id. Ownership claim failure is the only load balancing scenario in which a partition id is expected (null should be kept for other types of failure).

Goal

Make the necessary changes to the client to make the tests pass reliably without flakiness. The tests themselves could be the real problem, so changes to the tests might also be necessary.
Remove the Ignore attribute from the aforementioned tests.
Assert that no other tests fail because of the changes.

Issue Analytics

State:
Created 4 years ago
Comments:11 (9 by maintainers)

Top GitHub Comments

1reaction

jsquirecommented, Mar 5, 2020

Looks like that disable was included in #10320

0reactions

jsquirecommented, Mar 6, 2020

In theory, no… the CI and nightly runs are the same for unit tests. The nightly runs include the live tests as well and, as a result, have a much higher potential for intermittent delays due to longer-running operations , availability in the thread pool, and general wonkiness around ARM calls.

Consequently, we seem to find tests with timing sensitivity during nightly runs moreso than CI or local runs where the time variance isn’t as dramatic.