Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Long-lived h2 gRPC connections stop forwarding requests

See original GitHub issue

We’re seeing an issue with linkerd 1.1.0 where after 12-18 hours, deadlines start to occur regularly when sending traffic over an h2 router. It also seems that the longer the linkerd instance is left running, the more deadlines occur.

There don’t appear to be any relevant linkerd log messages. The client observes this behavior as a timeout, and sends an h2 reset frame after its deadline expires:

Jun 15 14:07:11 nomad-client2-p.dal10sl.bigcommerce.net linkerd[3398]: W 0615 19:07:11.311 UTC THREAD55 TraceId:e3c3649cc6eaf30c: Exception propagated to the default monitor (upstream address: /172.17.0.4:42448, downstream address: /10.143.147.85:4143, label: %/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig).
Jun 15 14:07:11 nomad-client2-p.dal10sl.bigcommerce.net linkerd[3398]: Reset.Cancel

These are the relevant failure metrics for the client in question (h2-out):

"rt/h2-out/client/%/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig/failures": 79858,
  "rt/h2-out/client/%/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig/failures/com.twitter.finagle.buoyant.h2.Reset$Cancel$": 79813,
  "rt/h2-out/client/%/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig/failures/com.twitter.finagle.buoyant.h2.Reset$Refused$": 6,
  "rt/h2-out/client/%/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig/failures/com.twitter.finagle.buoyant.h2.Reset$InternalError$": 39,

I’ll attach a full metrics dump below.

Visually, this is what the client looks like in the linkerd admin console when the client gets in this state: