CosmosClientOptions.RequestTimeout - What is the actual scope?
See original GitHub issueCurrently we have a configuration property called CosmosClientOptions.RequestTimeout
which is often a source of user questions or support cases.
What does RequestTimeout really do?
RequestTimeout
sets a timeout at the network request level, for a single network interaction. And this is where the confusion comes, because our users don’t do network requests, they do operations. An operation (a query, a item create, etc), can sometimes span multiple network requests if there is throttling or if we need to go through many pages (like in a query).
In this cases, customers often find it hard to understand why if they set RequestTimeout
to 10 seconds, an operation might have taken 14 seconds. If see the Diagnostics of the operation, we can probably see that it involved maybe 7 requests of 2 seconds each, so technically, everything is still within the bounds (because the bounds are at the network request level and each request took < 10 seconds).
Graphically, it could be seen like this:
So, what can the user do if they are interested in operation-level timeout?
Use CancellationTokens. CancellationTokens define an operation-level timeout:
The only caveat there is that CancellationTokens have a design principle of cooperative cancellation. This means that the library checks whenever it is safe to do so and it will not cause an inconsistent state. So, we cannot cancel when a byte is placed in the network wire for example, it is not an exact stop time. Even defining a CancellationToken there could be a chance of the operation succeeding after the expiration time (for example, HttpClient returns success responses even if the CancellationToken expired while receiving the response).
What could we do to improve the experience?
The question is really, do we need a RequestTimeout? Do users really need it? Or do they want an operation-level timeout, such as the CancellationToken semantics?
In that case, should we have a OperationDefaultTimeout that just sets a default CancellationToken for all operations if the user is not overriding it at the API call?
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
In our case, we are using the cosmosDB to fetch the users’ preferences on page load. To avoid making a long page load time, we want to have a Timeout for the entire query, say, 1 second. If the entire query times out, we want to cancel that query and do some failover. We don’t care how many requests the query is actually using in this scenario.
And I feel that most of the people will interpret the
RequestTimeout
as the time out for the entire query. The many requests under the hood is a implementation detail that users are not aware of. How about having two time out likeQueryTimeout
andPerRequestTimeout
.Had an offline conversation with @sebader. His experience with 3.19.0 is what is described in the linked PR, the first request will fail because the CancellationToken is blocking the retry after the regional DNS failure, but subsequent requests target the next region, which is expected. Time required to detect the DNS failure depend on the failure nature itself and how long does HttpClient take to detect it, there could be enough time after detecting to do a retry or not, depending on the Token time and HttpClient detection time or not, but the goal is to not block the failure detection kicking in and marking the region unavailable. Rather have 1 request fail (which could be retried with a retry policy on timeouts) than all of them.