Long-lived h2 gRPC connections stop forwarding requests
See original GitHub issueWe’re seeing an issue with linkerd 1.1.0 where after 12-18 hours, deadlines start to occur regularly when sending traffic over an h2 router. It also seems that the longer the linkerd instance is left running, the more deadlines occur.
There don’t appear to be any relevant linkerd log messages. The client observes this behavior as a timeout, and sends an h2 reset frame after its deadline expires:
Jun 15 14:07:11 nomad-client2-p.dal10sl.bigcommerce.net linkerd[3398]: W 0615 19:07:11.311 UTC THREAD55 TraceId:e3c3649cc6eaf30c: Exception propagated to the default monitor (upstream address: /172.17.0.4:42448, downstream address: /10.143.147.85:4143, label: %/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig).
Jun 15 14:07:11 nomad-client2-p.dal10sl.bigcommerce.net linkerd[3398]: Reset.Cancel
These are the relevant failure metrics for the client in question (h2-out
):
"rt/h2-out/client/%/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig/failures": 79858,
"rt/h2-out/client/%/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig/failures/com.twitter.finagle.buoyant.h2.Reset$Cancel$": 79813,
"rt/h2-out/client/%/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig/failures/com.twitter.finagle.buoyant.h2.Reset$Refused$": 6,
"rt/h2-out/client/%/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig/failures/com.twitter.finagle.buoyant.h2.Reset$InternalError$": 39,
I’ll attach a full metrics dump below.
Visually, this is what the client looks like in the linkerd admin console when the client gets in this state:
Issue Analytics
- State:
- Created 6 years ago
- Comments:20 (20 by maintainers)
Top Results From Across the Web
gRPC Long-lived Streaming - Code The Cloud
Implementing gRPC long-lived streaming - a tool for cloud native applications. ... A typical RPC is an immediate request-response mechanism.
Read more >HAProxy 1.9.2 Adds gRPC Support
This allows data to be piped back and forth over a long-lived connection, breaking free of the limitations of the request/response-per-message ...
Read more >New – Application Load Balancer Support for End-to-End ...
In this way, you can use ALBs to terminate, route and load balance the gRPC traffic between your microservices or between gRPC-enabled clients ......
Read more >Long-lived channels and asynchronous calls - Google Groups
A simple NOOP service that you issue a request to every 15 minutes-2 hours would be enough. You could ignore the response of...
Read more >HAProxy version 2.4.15 - Configuration Manual - GitHub Pages
If a server supports long lines, it may make sense to set this value ... good enough distribution and connections are extremely short-lived....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
🎉 🎉 🎉
https://github.com/linkerd/linkerd/pull/1444 should fix this issue