Response status code does not indicate success: RequestTimeout (408)
See original GitHub issueSimilar issue to this issue but in our case seems like CPU ramps up to 97% but I can’t understand why.
Our Cosmos DB is set to auto scale and we haven’t crossed 50% of the max RU consumption in the last 7 days.
The update is requested from Azure Function v4 (linux, net6.0, isolated process) on Premium Plan.
I followed this document: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-dot-net-sdk-request-timeout?tabs=cpu-new#high-cpu-utilization
and cross checked all points:
- All SNAT connections were successful (latest 24h)
- We use our
CosmosContext
that inherits fromDbContext
services.AddDbContext<CosmosContext>(options =>
{
options.UseCosmos(configuration[AppSettingsKeys.CosmosDbConnection], "somenamehere");
});
which internally creates a singleton of CosmosClient
-
We are nowhere near the service limits
-
There is no HTTP proxy
This happens in a particular function that pull the document and updates nested properties. The document is around 40KB. The function has service bus trigger and retry policy:
"retry": {
"strategy": "exponentialBackoff",
"maxRetryCount": 3,
"minimumInterval": "00:00:03",
"maximumInterval": "00:00:10"
},
I have no idea what’s going on.
Here is the diagnostics registered in the exception details:
"Diagnostics":{
"name":"ReplaceItemStreamAsync",
"id":"d4330cac-9cd4-4fb9-ac70-26a0942b96a6",
"caller info":{
"member":"OperationHelperWithRootTraceAsync",
"file":"ClientContextCore.cs",
"line":244
},
"start time":"10:45:08:241",
"duration in milliseconds":12210.9945,
"data":{
"Client Configuration":{
"Client Created Time Utc":"2022-06-10T11:56:21.5647195Z",
"NumberOfClientsCreated":2,
"User Agent":"cosmos-netstandard-sdk/3.21.0|3.21.1|2|X64|Linux 5.4.0-1074-azure 77 18.|.NET 6.0.5|N| Microsoft.EntityFrameworkCore.Cosmos/6.0.5",
"ConnectionConfig":{
"gw":"(cps:50, urto:10, p:False, httpf: False)",
"rntbd":"(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
"other":"(ed:False, be:False)"
},
"ConsistencyConfig":"(consistency: NotSet, prgns:[])"
}
},
"children":[
{
"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
"id":"3892e9c8-a327-4ae2-a1b4-4b30b552721c",
"start time":"10:45:08:241",
"duration in milliseconds":12210.9644,
"children":[
{
"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
"id":"042f8751-3514-46ac-bd3b-e51ff061ac70",
"start time":"10:45:08:241",
"duration in milliseconds":12210.932,
"data":{
"System Info":{
"systemHistory":[
{
"dateUtc":"2022-06-14T10:44:12.4755898Z",
"cpu":9.907,
"memory":3178468.000,
"threadInfo":{
"isThreadStarving":"False",
"threadWaitIntervalInMs":0.0213,
"availableThreads":32766,
"minThreads":2,
"maxThreads":32767
}
},
{
"dateUtc":"2022-06-14T10:44:22.4788493Z",
"cpu":4.343,
"memory":3178484.000,
"threadInfo":{
"isThreadStarving":"False",
"threadWaitIntervalInMs":0.0088,
"availableThreads":32766,
"minThreads":2,
"maxThreads":32767
}
},
{
"dateUtc":"2022-06-14T10:44:39.0703495Z",
"cpu":79.250,
"memory":3484276.000,
"threadInfo":{
"isThreadStarving":"False",
"threadWaitIntervalInMs":0.209,
"availableThreads":32756,
"minThreads":2,
"maxThreads":32767
}
},
{
"dateUtc":"2022-06-14T10:44:51.4720374Z",
"cpu":79.208,
"memory":2110288.000,
"threadInfo":{
"isThreadStarving":"False",
"threadWaitIntervalInMs":6.154,
"availableThreads":32737,
"minThreads":2,
"maxThreads":32767
}
},
{
"dateUtc":"2022-06-14T10:45:01.5421178Z",
"cpu":82.129,
"memory":959112.000,
"threadInfo":{
"isThreadStarving":"False",
"threadWaitIntervalInMs":0.3395,
"availableThreads":32732,
"minThreads":2,
"maxThreads":32767
}
},
{
"dateUtc":"2022-06-14T10:45:20.1404512Z",
"cpu":97.987,
"memory":1891392.000,
"threadInfo":{
"isThreadStarving":"False",
"threadWaitIntervalInMs":1.2721,
"availableThreads":32730,
"minThreads":2,
"maxThreads":32767
}
}
]
}
},
"children":[
{
"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler",
"id":"b87c1d09-2c23-470f-988e-70558cfcdcb5",
"start time":"10:45:08:241",
"duration in milliseconds":12210.9261,
"children":[
{
"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler",
"id":"b513f900-a379-4bfe-b5f3-9d52d15398ff",
"start time":"10:45:08:241",
"duration in milliseconds":12210.7416,
"children":[
{
"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler",
"id":"27ab336b-34d4-405d-9534-ab79980d0b29",
"start time":"10:45:08:241",
"duration in milliseconds":12210.6676,
"children":[
{
"name":"Microsoft.Azure.Documents.ServerStoreModel Transport Request",
"id":"ee060395-4562-4b8c-a6b8-c24daf7d3e45",
"caller info":{
"member":"ProcessMessageAsync",
"file":"TransportHandler.cs",
"line":109
},
"start time":"10:45:08:241",
"duration in milliseconds":12169.0857,
"data":{
"Client Side Request Stats":{
"Id":"AggregatedClientSideRequestStatistics",
"ContactedReplicas":[
{
"Count":1,
"Uri":""
},
{
"Count":1,
"Uri":""
},
{
"Count":1,
"Uri":""
}
],
"RegionsContacted":[
],
"FailedReplicas":[
],
"AddressResolutionStatistics":[
],
"StoreResponseStatistics":[
]
}
}
}
]
}
]
}
]
}
]
}
]
}
]
}
Additionally I get “ghost updates”:
product.UpdateStock(5);
await _cosmosContext.SaveChangesAsync(CancellationToken);
_logger.Information("Stock Update {@Request}", new
{
product.StockQuantity,
});
The log tells me it has updated the document: product.StockQuantity = 5
but querying the actual document reveals it is still set with the value from the previous update: product.StockQuantity = 0
.
No exception is thrown related to this particular update.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
There are still 2 clients being created and active, is this what you expect?
This is a timeout, there are 2 potential issues:
You have high Transit Time, meaning, something is not entirely right in the network (2 seconds for a request is massive).
Very high time on Received: This means the response is sitting there waiting ~8 seconds to be consumed. This points to thread pool issues. I/O response is an async operation, this is the time before the async Task is processed, meaning that the thread-pool cannot assign a thread to continue that async Task for 8 seconds. This usually points at code in the app blocking threads (https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-dot-net-sdk-slow-request?tabs=cpu-new#rntbdRequestStats), meaning that some code might not following
await/async
and using.Result/GetAwaiter().GetResult()/etc
that might be blocking threads and preventing those threads from being used by the thread pool to resume async operations. This can also lead to high CPU usage. Useful guide: https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/blob/master/AsyncGuidance.md#avoid-using-taskresult-and-taskwaitCPU values in Linux are obtained from
/proc/stat/cpu
, it’s the system wide CPU. I don’t know what those metrics in the Portal read.Transient timeouts can happen and the app should have some way to handle them: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/conceptual-resilient-sdk-applications#timeouts-and-connectivity-related-failures-http-408503
It’s when the volume affects P99 that you should investigate: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/conceptual-resilient-sdk-applications#when-to-contact-customer-support
Reference: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-dot-net-sdk-request-timeout?tabs=cpu-new#troubleshooting-steps
Please update the SDK to a newer version and share the updated diagnostics. The version you are using does not include diagnostics for timeouts (added on 3.24 https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/changelog.md#-3240---2022-01-31).
The only thing we can see is that there seems to be 2 clients:
"NumberOfClientsCreated":2,
We cannot tell you why your CPU is high, CPU analysis needs to be performed on the running machine.