Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OperationCancelledException on first call(s) to Cosmos from Azure App Service

See original GitHub issue

I’ve got a strange issue in an Azure App Service connecting to Cosmos DB.

When I start the App Service up afresh (after deploying or manual restart) then I’m finding that I get the following error on the first call(s) to Cosmos:

System.OperationCanceledException: The operation was canceled. at System.Threading.CancellationToken.ThrowOperationCanceledException() at Microsoft.Azure.Cosmos.Query.Core.ExecutionContext.CosmosQueryExecutionContextFactory.TryCreateFromPartitionedQuerExecutionInfoAsync(DocumentContainer documentContainer, PartitionedQueryExecutionInfo partitionedQueryExecutionInfo, ContainerQueryProperties containerQueryProperties, CosmosQueryContext cosmosQueryContext, InputParameters inputParameters, ITrace trace, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Query.Core.ExecutionContext.CosmosQueryExecutionContextFactory.TryCreateCoreContextAsync(DocumentContainer documentContainer, CosmosQueryContext cosmosQueryContext, InputParameters inputParameters, ITrace trace, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Query.Core.AsyncLazy`1.GetValueAsync(ITrace trace, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Query.Core.Pipeline.LazyQueryPipelineStage.MoveNextAsync(ITrace trace) at Microsoft.Azure.Cosmos.Query.Core.Pipeline.NameCacheStaleRetryQueryPipelineStage.MoveNextAsync(ITrace trace) at Microsoft.Azure.Cosmos.Query.Core.Pipeline.CatchAllQueryPipelineStage.MoveNextAsync(ITrace trace) at Microsoft.Azure.Cosmos.Query.QueryIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken) CosmosDiagnostics: {"name":"Typed FeedIterator ReadNextAsync","id":"0159ae93-5f41-4379-b2c3-44493e72af14","component":"Unknown","caller info":{"member":"ReadNextWithRootTraceAsync","file":"FeedIteratorInternal{T}.cs","line":31},"start time":"04:25:10:050","duration in milliseconds":1070.2521,"data":{},"children":[{"name":"Create Query Pipeline","id":"de8fe725-edc8-4ed1-9cd0-da2c427a3efd","component":"Query","caller info":{"member":"TryCreateCoreContextAsync","file":"CosmosQueryExecutionContextFactory.cs","line":85},"start time":"04:25:10:099","duration in milliseconds":1012.1531,"data":{},"children":[{"name":"Get Container Properties","id":"e7fd64a5-b0e9-4c38-9ea2-a19c6d00dd99","component":"Transport","caller info":{"member":"GetCachedContainerPropertiesAsync","file":"ClientContextCore.cs","line":349},"start time":"04:25:10:100","duration in milliseconds":0.5946,"data":{},"children":[{"name":"Get Collection Cache","id":"91feaf20-c06b-4c20-8f2e-babdd8fb412a","component":"Routing","caller info":{"member":"GetCollectionCacheAsync","file":"DocumentClient.cs","line":542},"start time":"04:25:10:101","duration in milliseconds":0.0054,"data":{},"children":[]}]},{"name":"Service Interop Query Plan","id":"27a117d7-724e-434b-a257-c2d3cd604672","component":"Query","caller info":{"member":"GetQueryPlanWithServiceInteropAsync","file":"QueryPlanRetriever.cs","line":58},"start time":"04:25:10:109","duration in milliseconds":992.7327,"data":{},"children":[]}]}]}

If I send me request again then it all seems fine.

At first I thought it might be a startup thing - so I tried adding some warm up code to ReadItemStreamAsync for items that don’t exist on each Container. This didn’t work. I have noticed that I can leave the server for as long as I like before the first call and that first (or first ones) fail with the above trace. So it’s like some sort of lazy initialisation takes too long on the first call(s).

Framework version: .Net 5.0 SDK Version: 3.17.0 OS Version: Windows (Azure App Service latest) App: x64 (win-x64) Connection: Default Direct

Single Cosmos Database with AutoScale Throughput (up to 4000 RUs) 23 Shared Throughput Containers (default 400 RU base)

Any ideas what could be causing this? I’ve read the performance guide but nothing seems to apply to this problem directly. I’m not sending masses of calls though, even a single one gets this error.

Issue Analytics

State:
Created 3 years ago
Comments:12 (7 by maintainers)

Top GitHub Comments

2reactions

ealsurcommented, Mar 11, 2021

@andrew-tevent the reason a ReadItem call might not completely warm up all connections is because it is targeted to a particular partition, not all.

When the SDK initializes, it does a couple of things:

Fetch the account information (HTTP request) and cache it
Fetch the collection information (HTTP request) for which the request is going to and cache it
Fetch the routing information (HTTP request) to obtain the IP addresses of the backend replicas and cache it
Open TCP connections for the partition on which the document you are accessing is and keep them open

A ReadItem call will do all these, a second ReadItem call, will avoid the first 2, but it might be to read a different partition (based on the hash of the partition key value) so there might exist a need to open new TCP connections to a different partition. The overhead is not as big as the first ReadItem, but could be there.

Once the TCP connections are established, any request landing on the same partition pays no overhead cost.

1reaction

bartelinkcommented, Mar 11, 2021

Yes, you’re right of course - it is a replacement for calling Create on the builder (I knew that but had forgotten!).

My general comments about doing this in a controlled fashion during app startup remain though.

There’s a series of very good blog posts where Andrew Lock walks through ways to structure this within an app

Perhaps the followup post might give you ideas how to structure it too

Top Results From Across the Web

CosmosOperationCanceledExce...

Initializes a new instance of the CosmosOperationCanceledException class. Properties. Data, System.OperationCanceledException.Data. Diagnostics. Gets ...

CosmosDb client throwing timeout Exception · Issue #3014

We are using "Microsoft.Azure.Cosmos" Version="3.23.0" in Net Core application. After migrating to the newer client SDK, we are seeing ...

C# : System.OperationCanceledException: The operation ...

I have a function app that is throwing a "System.OperationCanceledException: The operation was canceled" exception in chunks.

azure functions throw System.OperationCanceledException

ThrowOperationCanceledException - typically means that a service you were calling in code was taking longer than a defined timeout period.

Best Practices for Graceful shutdown in Azure Functions

For serverless applications such as Azure Functions, Graceful shutdown is an important process to maintain the integrity of the application.