Cancellation mechanism has negative impact on overall latency
See original GitHub issueWhat version of gRPC and what language are you using?
2.49.0 and 2.45.0
What operating system (Linux, Windows,…) and version?
MacOS ARM64, but I believe it also happens on Linux x86 since I have spotted this issue on production servers (docker).
What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info
)
SDK is dotnet 6.0.401, runtime is dotnet 6.
What did you do?
Compare Grpc.Core VS Grpc.Net.Client clients when using a timeout (cancellation token with duration)
What did you expect to see?
I expect a timeout to have no effect on the duration of a call when the call duration is expected to be smaller than the timeout value. I know that there are some imprecisions with the clock, however this is a different problem here.
For example, if 50% of calls take less than 2ms when I use no cancellation token (or a large duration), I expect to still have 50% of the calls take less than 2ms if I change this timeout to let’s say 20ms.
What did you see instead?
Given the example above, if I change the timeout to 20ms, then everything seems slowed down, up to a point that it no longer is able to perform a single call under 20ms.
Repro
See https://github.com/ogxd/grpc-net-client-migration-perf-repro. There is a test project that you can use to simulate client/server calls locally with both Grpc.Core and Grpc.Net.Client. You’ll see that using different timeout values with Grpc.Core does not affect the overall response time, while it does with Grpc.Net.Client.
With Grpc.Core, the 50p (median) is the same if we use a 200ms timeout or a 10ms timeout value. With Grpc.Net.Client, the 50p (median) drops to 75ms with a 30ms timeout, and is even unable to make a single successful call when the timeout is 10ms.Another weird thing is that it seems to work as expected when I run all Grpc.Net.Client tests at once? Maybe it’s linked to something done at some kind of initialization stage, or just a weird side effect I don’t know yet.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (4 by maintainers)
Top GitHub Comments
Canceling a call via cancellation token:
https://github.com/grpc/grpc-dotnet/blob/456098bcfeb9c796c19ea023b3ab3703181f1089/src/Grpc.Net.Client/Internal/GrpcCall.cs#L398-L404
And canceling a call via a deadline:
https://github.com/grpc/grpc-dotnet/blob/456098bcfeb9c796c19ea023b3ab3703181f1089/src/Grpc.Net.Client/Internal/GrpcCall.cs#L1047-L1059
Both happen in the same way: calling
CancelCall
, which tells the underlying HTTP handler to abort.The difference is there is extra code to ensure the deadline is canceled after the specified time. .NET timers aren’t precise. There can be up to +14ms/-14ms precision in exactly when they run. I added extra code when a deadline is specified to keep a call going if the deadline hasn’t passed:
https://github.com/grpc/grpc-dotnet/blob/456098bcfeb9c796c19ea023b3ab3703181f1089/src/Grpc.Net.Client/Internal/GrpcCall.cs#L1021-L1036
https://github.com/grpc/grpc-dotnet/blob/456098bcfeb9c796c19ea023b3ab3703181f1089/src/Shared/CommonGrpcProtocolHelpers.cs#L28-L54
Probably the reason why a deadline is working better is it is forcing a longer timeout. As discussed earlier, very small timeouts are bad.
Thanks a lot for the pointers and clarifications.