Availability: Adds cross-region retry mechanism on 500 issues
See original GitHub issueDescribe the bug When CosmosDB region is down (e.g. like with NOA region issue happened at 1:00 AM UTC February 2021 (Azure Cosmos DB - North America - Mitigated (Tracking ID CVTV-R80) ) it responds with 500 error code:
And it looks like it continues to make requests to region that constantly responds with 500 and doesn’t fallback to another region.
There was a fix (see #1715) to address similar problem, but it handles only 503, but not 500.
To Reproduce Make read or write requests to region when region is down and responds with 500 error codes.
Expected behavior
If PreferredLocations > 1
and CosmosDB account has multiple read regions, SDK client stops sending requests to region that is responding with 500
and falls back to another region in PreferredLocations
list. Already failed with 500
requests are retried to another PreferredLocation.
Actual behavior SDK continues to send requests to same region that responds with 500.
Environment summary SDK Version: latest for v2 and v3. OS Version: observed on Windows.
Additional context N/A
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
@ealsur I understand your point that retries or automatic failover to another region on HTTP 500 can lead to the same 500 on another region. But there’s also a chance that another region will be successful, we simply don’t know.
For example, there’s a code in ClientRetryPolicy here and it does mark endpoint as unavailable globally in the case of
HttpRequestException
without analyzing the status code and does retry.So what I don’t understand from your point and the code is why
HttpRequestException
exception type without status code analysis is retriable and leads to endpoint failover and rediscovery, but 500 is not treated asHttpRequestException
?If you still believe that failover and retry of 500s shouldn’t be handled by default, I think this has to be at least configurable to give a client a chance to decide what to do with this situation (as it’s done e.g. for 429s) as continue sending requests to the region that responds for minutes or hours with 500 is not an optimal design decision.
The stack trace looks like V2 SDK, not V3 (mainly because of
Microsoft.Azure.Documents.Linq.DocumentQuery
1.<ExecuteNextPrivateAsync>d__361.MoveNext()
, V3 SDK’s namespace is Microsoft.Azure.Cosmos).TaskCanceledException
sounds like an HTTP timeout, in this case, trying to get Addresses for the target partitions.Newer V2 SDKs have fixes and improvements on resiliency for queries (particularly 2.11.5 and 2.12.0), so the recommendation would be to upgrade to those if possible.