Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Frequent request timeouts (408)

See original GitHub issue

We recently started using Azure Cosmos DB and it became obvious we don’t fully understand how to deal with some of the issues it brings.

In particular, we are observing a large number of request timeouts.

The exceptions look like this:

[
    {
        "Details": null,
        "InnerExceptions": [
            {
                "Details": null,
                "InnerExceptions": [],
                "Message": "A client transport error occurred: The request timed out while waiting for a server response. (Time: 2020-06-09T12:43:32.8643249Z, activity ID: ebaa68c7-b8fc-46a8-8fb2-9d345a4b94d2, error code: ReceiveTimeout [0x0010], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-westeurope1-fd12.documents.azure.com:14023/apps/b354ae5f-004d-4332-9e8b-699797d3441b/services/c6c0736e-5b33-4ec7-9917-25318f7713b8/partitions/1d54230d-f870-44cd-affb-83e77d5fc9ba/replicas/132357606089777958p/, connection: 10.0.4.110:56578 -> 13.69.112.4:14023, payload sent: True, CPU history: (2020-06-09T12:42:32.0825005Z 22.305), (2020-06-09T12:42:42.0804253Z 18.633), (2020-06-09T12:42:52.0768108Z 21.445), (2020-06-09T12:43:02.4892878Z 40.478), (2020-06-09T12:43:29.3331157Z 97.142), (2020-06-09T12:43:32.8487003Z 98.556), CPU count: 4)",
                "StackTrace": [
                    "   at Microsoft.Azure.Documents.Rntbd.Channel.<RequestAsync>d__13.MoveNext()",
                    "--- End of stack trace from previous location where exception was thrown ---",
                    "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
                    "   at Microsoft.Azure.Documents.Rntbd.LoadBalancingPartition.<RequestAsync>d__9.MoveNext()",
                    "--- End of stack trace from previous location where exception was thrown ---",
                    "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
                    "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
                    "   at Microsoft.Azure.Documents.Rntbd.TransportClient.<InvokeStoreAsync>d__10.MoveNext()"
                ],
                "Type": "Microsoft.Azure.Documents.TransportException"
            }
        ],
        "Message": "Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: ebaa68c7-b8fc-46a8-8fb2-9d345a4b94d2; Reason: (Message: Request timed out.\r\nActivityId: ebaa68c7-b8fc-46a8-8fb2-9d345a4b94d2, Request URI: /apps/b354ae5f-004d-4332-9e8b-699797d3441b/services/c6c0736e-5b33-4ec7-9917-25318f7713b8/partitions/1d54230d-f870-44cd-affb-83e77d5fc9ba/replicas/132357606089777958p/, RequestStats: Please see CosmosDiagnostics, SDK: Windows/10.0.14393 cosmos-netstandard-sdk/3.9.0);",
        "StackTrace": [
            "   at Microsoft.Azure.Documents.Rntbd.TransportClient.<InvokeStoreAsync>d__10.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
            "   at Microsoft.Azure.Documents.ConsistencyWriter.<WritePrivateAsync>d__18.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at Microsoft.Azure.Documents.StoreResult.VerifyCanContinueOnException(DocumentClientException ex)",
            "   at Microsoft.Azure.Documents.StoreResult.CreateStoreResult(StoreResponse storeResponse, Exception responseException, Boolean requiresValidLsn, Boolean useLocalLSNBasedHeaders, Uri storePhysicalAddress)",
            "   at Microsoft.Azure.Documents.ConsistencyWriter.<WritePrivateAsync>d__18.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetryAsync>d__5.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)",
            "   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetryAsync>d__5.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
            "   at Microsoft.Azure.Documents.ConsistencyWriter.<WriteAsync>d__17.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
            "   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)",
            "   at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass27_0.<<InvokeAsync>b__0>d.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
            "   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)",
            "   at Microsoft.Azure.Documents.RequestRetryUtility.<ProcessRequestAsync>d__2`2.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)",
            "   at Microsoft.Azure.Documents.RequestRetryUtility.<ProcessRequestAsync>d__2`2.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
            "   at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__19.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
            "   at Microsoft.Azure.Documents.ServerStoreModel.<ProcessMessageAsync>d__15.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
            "   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.<ProcessMessageAsync>d__3.MoveNext()",
            "--- End of stack trace from previous location where exception was thrown ---",
            "   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()",
            "   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)",
            "   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.<SendAsync>d__2.MoveNext()"
        ],
        "Type": "Microsoft.Azure.Cosmos.CosmosException"
    }
]

We are migrating lots of data from SQL Server to Cosmos and the access pattern is as follows:

while (!allMigrated)
{
	foreach (var batchNumber in 1..50)
	{
		var items = FetchFromSQL(count: 100)	// This takes about 2 seconds

		var writeTasks = items.Select(i =>
                {            
                    return container.CreateItemAsync(i, requestOptions: new ItemRequestOptions { EnableContentResponseOnWrite = false });
                });

                Task.WhenAll(writeTasks).ConfigureAwait(false).GetAwaiter().GetResult();
	}

	Wait(10 minutes)
}

There may be other threads writing to Cosmos at the same time, but these will typically write just a few items at a time.

We are using a single instance of CosmosClient throughout the application.

On Azure Portal, I can see we are not being throttled:

So my question is basically - why are the requests timeouting so often when we don’t even hit the provisioned RU limit? (we currently have 11,000 RU/s in autoscale mode).

Are we using it wrong? Is there a recommended pattern for inserting batch/large amount of data at once? AllowBulkExecution is not really useful as it waits up to 1 second for a batch to fill and there will be situations where the batch will just not fill up quickly enough (the above migrator runs only every 10 minutes).

Can request timeouts be also caused by rate throttling (but that would not make much sense as the Azure Portal shows we are not being rate-throttled).

I read through the request timeout troubleshooting guide and the only relevant points seem to be these:

Users sometimes see elevated latency or request timeouts because their collections are provisioned insufficiently, the back-end throttles requests, and the client retries internally. Check the portal metrics.
Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot partition key. This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput.

And these two points go back to my question - how do I evaluate precisely what’s the reason for the timeouts? I can try raising the provisioned RUs up until the point timeouts stop but that hardly seems like a reasonable approach.

Thank you for any insight.

Issue Analytics

State:
Created 3 years ago
Reactions:6
Comments:24 (9 by maintainers)

Top GitHub Comments

2reactions

dan-matthewscommented, Sep 2, 2020

Thanks for feedback @j82w, I have been through that in detail already. I’m running on Linux App Service with a good partition key (document id) and I’ve tested in both Direct and Gateway modes, and played with idle timeouts and port re-use. I’m async in my whole architecture, use a singleton to hold my Cosmos client and I’m using .Net Core 3.1 and the latest version of the Cosmos DB SDK (3.12.0). I’ve also used the troubleshooter for my TCP connections in my App Service and everything is stable at 50 to 60 connections, nothing failing. The CPU on the App Service is running stable at about 5%, memory at about 40% and the RUs of the CosmosDB are peaking below 500 (it’s autopilot up to 4,000). I’ve put logging on my CosmosDB and it seems the requests don’t even get to it, because there isn’t any queries running there for more than a few milliseconds (or… if it is running, it’s returning quickly and getting lost). Basically, the entire architecture is just ticking over, not breaking sweat at all.

Yet, no matter what I try playing with, I still get 408 and socket timeouts on random requests. Normally at a rate of about 1 in 100. It also doesn’t matter whether the App Service has just started or been running a few hours. The error is always occurring on the MoveNext of a Cosmos method - whether it’s a Feed Iterator, a Stream Iterator or just trying CreateContainerIfNotExistsAsync. Here is an example of one - this hung for 1.1 minutes then crashed out with a CanceledException:

Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: fc033e9e-0cc8-45d8-8d7f-ffa258f9c7d4; Reason: (GatewayStoreClient Request Timeout. Start Time:09/02/2020 09:35:27; Total Duration:00:01:05.0137418; Http Client Timeout:00:01:05; Activity id: fc033e9e-0cc8-45d8-8d7f-ffa258f9c7d4; Inner Message: The operation was canceled.;, Request URI: /dbs/XXXX/colls/XXXX, RequestStats: , SDK: Linux/10 cosmos-netstandard-sdk/3.11.4); The operation was canceled.

Or another, this time it hung for 1.1 mins and then crashed with a SocketException:

Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: 5806f9f6-6b9f-4aa3-957e-f6cb507123e8; Reason: (GatewayStoreClient Request Timeout. Start Time:09/02/2020 10:28:33; Total Duration:00:01:05.0043387; Http Client Timeout:00:01:05; Activity id: 5806f9f6-6b9f-4aa3-957e-f6cb507123e8; Inner Message: The operation was canceled.;, Request URI: /dbs/XXXX/colls/XXXX/docs, RequestStats: , SDK: Linux/10 cosmos-netstandard-sdk/3.11.4); The operation was canceled. Unable to read data from the transport connection: Operation canceled. Operation canceled

It basically seems like I makes the request and then loses the response, so it just hangs. If you have any other ideas I’d love to hear them because I’m kinda running out of options 😃 I did read somewhere to change the await to a Wait() on a Task, so I tried that with no luck. I’m desperate. I’ll try anything 😉

1reaction

j82wcommented, Jul 26, 2021

@SumiranAgg do not use ReadContainerAsync as a health check. The read container is a metadata operation. The metadata operations is Cosmos DB are limited and will eventually get throttled. It is also only called once on the SDK initialization. I would recommend doing a data plane operation like ReadItemStream on a non-existing document. This will make sure you can actually connect and get a response from the container.

Regarding the RequestTimeout make sure you are using the latest SDK 3.20.1. If it’s still an issue after these changes it would be best to open a support ticket…