question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Availability: Adds cross-region retry mechanism on 500 issues

See original GitHub issue

Describe the bug When CosmosDB region is down (e.g. like with NOA region issue happened at 1:00 AM UTC February 2021 (Azure Cosmos DB - North America - Mitigated (Tracking ID CVTV-R80) ) it responds with 500 error code:

image

And it looks like it continues to make requests to region that constantly responds with 500 and doesn’t fallback to another region.

There was a fix (see #1715) to address similar problem, but it handles only 503, but not 500.

To Reproduce Make read or write requests to region when region is down and responds with 500 error codes.

Expected behavior If PreferredLocations > 1 and CosmosDB account has multiple read regions, SDK client stops sending requests to region that is responding with 500 and falls back to another region in PreferredLocations list. Already failed with 500 requests are retried to another PreferredLocation.

Actual behavior SDK continues to send requests to same region that responds with 500.

Environment summary SDK Version: latest for v2 and v3. OS Version: observed on Windows.

Additional context N/A

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
yar-shukancommented, Feb 16, 2021

@ealsur I understand your point that retries or automatic failover to another region on HTTP 500 can lead to the same 500 on another region. But there’s also a chance that another region will be successful, we simply don’t know.

For example, there’s a code in ClientRetryPolicy here and it does mark endpoint as unavailable globally in the case of HttpRequestException without analyzing the status code and does retry.

So what I don’t understand from your point and the code is why HttpRequestException exception type without status code analysis is retriable and leads to endpoint failover and rediscovery, but 500 is not treated as HttpRequestException?

If you still believe that failover and retry of 500s shouldn’t be handled by default, I think this has to be at least configurable to give a client a chance to decide what to do with this situation (as it’s done e.g. for 429s) as continue sending requests to the region that responds for minutes or hours with 500 is not an optimal design decision.

0reactions
ealsurcommented, Feb 17, 2021

The stack trace looks like V2 SDK, not V3 (mainly because of Microsoft.Azure.Documents.Linq.DocumentQuery1.<ExecuteNextPrivateAsync>d__361.MoveNext(), V3 SDK’s namespace is Microsoft.Azure.Cosmos).

TaskCanceledException sounds like an HTTP timeout, in this case, trying to get Addresses for the target partitions.

Newer V2 SDKs have fixes and improvements on resiliency for queries (particularly 2.11.5 and 2.12.0), so the recommendation would be to upgrade to those if possible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling Server Errors\Internal Server Errors\HTTP 500 ...
Ensure that an appropriate retry mechanism is in place. Fourthly, despite the retry mechanism, your application faces an issue; if the issue is ......
Read more >
Error retries - Amazon Elastic Compute Cloud
Error Response code Thrown by Retryabl... InternalServerException 500 All APIs Yes ThrottlingException 400 All APIs Yes RequestThrottleException 400 GetSnapshotBlock | PutSnapshotBlock Yes
Read more >
Retry of "Safe" HTTP requests #56 - microsoft/reverse-proxy
when a request fails - retry against a different server.
Read more >
Retry strategy | Cloud Storage
This page describes how Cloud Storage tools retry failed requests and how to customize the behavior of retries. It also describes considerations for ......
Read more >
c# - Cleanest way to write retry logic?
First, it is emblematic of the maxim "the definition of insanity is doing the same thing twice and expecting different results each time"....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found