question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cancellation mechanism has negative impact on overall latency

See original GitHub issue

What version of gRPC and what language are you using?

2.49.0 and 2.45.0

What operating system (Linux, Windows,…) and version?

MacOS ARM64, but I believe it also happens on Linux x86 since I have spotted this issue on production servers (docker).

What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info)

SDK is dotnet 6.0.401, runtime is dotnet 6.

What did you do?

Compare Grpc.Core VS Grpc.Net.Client clients when using a timeout (cancellation token with duration)

What did you expect to see?

I expect a timeout to have no effect on the duration of a call when the call duration is expected to be smaller than the timeout value. I know that there are some imprecisions with the clock, however this is a different problem here.
For example, if 50% of calls take less than 2ms when I use no cancellation token (or a large duration), I expect to still have 50% of the calls take less than 2ms if I change this timeout to let’s say 20ms.

What did you see instead?

Given the example above, if I change the timeout to 20ms, then everything seems slowed down, up to a point that it no longer is able to perform a single call under 20ms.

Repro

See https://github.com/ogxd/grpc-net-client-migration-perf-repro. There is a test project that you can use to simulate client/server calls locally with both Grpc.Core and Grpc.Net.Client. You’ll see that using different timeout values with Grpc.Core does not affect the overall response time, while it does with Grpc.Net.Client.

image image With Grpc.Core, the 50p (median) is the same if we use a 200ms timeout or a 10ms timeout value. image image image With Grpc.Net.Client, the 50p (median) drops to 75ms with a 30ms timeout, and is even unable to make a single successful call when the timeout is 10ms.

Another weird thing is that it seems to work as expected when I run all Grpc.Net.Client tests at once? Maybe it’s linked to something done at some kind of initialization stage, or just a weird side effect I don’t know yet.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
JamesNKcommented, Oct 18, 2022

Canceling a call via cancellation token:

https://github.com/grpc/grpc-dotnet/blob/456098bcfeb9c796c19ea023b3ab3703181f1089/src/Grpc.Net.Client/Internal/GrpcCall.cs#L398-L404

And canceling a call via a deadline:

https://github.com/grpc/grpc-dotnet/blob/456098bcfeb9c796c19ea023b3ab3703181f1089/src/Grpc.Net.Client/Internal/GrpcCall.cs#L1047-L1059

Both happen in the same way: calling CancelCall, which tells the underlying HTTP handler to abort.

The difference is there is extra code to ensure the deadline is canceled after the specified time. .NET timers aren’t precise. There can be up to +14ms/-14ms precision in exactly when they run. I added extra code when a deadline is specified to keep a call going if the deadline hasn’t passed:

https://github.com/grpc/grpc-dotnet/blob/456098bcfeb9c796c19ea023b3ab3703181f1089/src/Grpc.Net.Client/Internal/GrpcCall.cs#L1021-L1036

https://github.com/grpc/grpc-dotnet/blob/456098bcfeb9c796c19ea023b3ab3703181f1089/src/Shared/CommonGrpcProtocolHelpers.cs#L28-L54

Probably the reason why a deadline is working better is it is forcing a longer timeout. As discussed earlier, very small timeouts are bad.

0reactions
ogxdcommented, Oct 18, 2022

Thanks a lot for the pointers and clarifications.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cancellation Latency: The Good, the Bad, and the Ugly
We find that cancellation latency is related to market quality and is not constant. ... Much of the literature has focused on the...
Read more >
Low Latency: The High-Speed Adventure in Distributed ...
This delay can negatively impact the gaming experience, causing players to miss shots or allowing opponents to ... Replication with cancellation mechanism.
Read more >
Latent Period - an overview | ScienceDirect Topics
Latency differences between MUPs may cause phase cancellation which will lead to a lower CMAP. If not incorporated in the mean MUP, phase...
Read more >
Is Task.Delay Worth Cancellation?
So yes, it's safer to cancel the task instead of letting it run out, though probably not by much.
Read more >
Latency vs Bandwidth: Unraveling Video Conferencing ...
Low bandwidth, on the other hand, has negative impacts. This includes slower data transfer speeds, buffering during media streaming, and delays ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found