Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

H2 connection starts to deadline every request on a connection after random interval

See original GitHub issue

We are mirroring some gRPC production traffic through linkerd and after a random interval (anywhere from 30 seconds to 2 hours) every request over the connection starts to deadline. After this point all of the request charts in the admin UI go to zero. If the connection is recreated through either restarting linkerd or restarting the client service, traffic flow is temporarily restored.

There are no abnormal log messages with TRACE verbosity turned on when linkerd gets in this state.

We did a tcpdump to verify that the traffic was reaching linkerd (it is). We observe that the client service is sending h2 request frames (DATA, HEADERS) over the stream, and then after the deadline interval sends a RST_STREAM, which is intended behavior. Nothing during this interval is sent from linkerd other than TCP ACK packets.
We also ran a test without linkerd to see if it works (it does).
We tried disabling failure accrual on the client linkerd and the server linkerd, it didn’t make a difference.

Metrics snapshot below.

Issue Analytics

State:
Created 6 years ago
Comments:15 (13 by maintainers)

Top GitHub Comments

2reactions

klingerfcommented, May 8, 2017

Ok, I was able to track down the issue and have put together a fix in #1280. I’ve also published a docker image from that branch, to buoyantio/linkerd:h2-fix. @zackangelo, @kenkouot, if you have a chance can you verify that that image fixes the issue in your environments?

1reaction

klingerfcommented, May 4, 2017

Still don’t have a fix for this issue, but I wanted to provide another update.

In my test setup, I’m running an h2 router on port 6262, which forwards requests to a gRPC server running locally on port 8282. When looking at request patterns that trigger the error described in this issue, the bad behavior appears to be happening in the 8282 client.

The 8282 client has 2 different patterns of state transitions. The most common one is:

outbound HEADER frame endStream=false => stream is Open/RemotePending
outbound DATA frame endStream=true => stream changes to LocalClosed/RemotePending
inbound HEADER frame endStream=false => stream changes to LocalClosed/RemoteStreaming
inbound DATA frame endStream=false => stream stays LocalClosed/RemoteStreaming
inbound HEADER frame endStream=true => stream changes to Closed/Closed

A less common pattern (roughly 15% of requests in a random sample) is:

outbound HEADER frame endStream=false => stream is Open/RemotePending
outbound DATA frame endStream=true => stream stays Open/RemotePending
inbound HEADER frame endStream=false => stream changes to Open/RemoteStreaming
inbound DATA frame endStream=false => stream stays Open/RemoteStreaming
inbound HEADER frame endStream=true => stream changes to Open/RemoteClosed
stream changes to Closed/Closed

Both of these patterns result in successful requests, but it’s only the second pattern that ever triggers the error in this issue. When the issue is triggered, the final inbound HEADER is received by the 8282 client, but it is never sent as an outbound HEADER on the 6262 server. Messages are passed from the client to the server using util’s AsyncQueue. In the error situation, the queue refuses the final .offer of the last HEADER frame, but it is not clear to me why the offer is refused. This seems like a race condition, if the queue is reset before the offer is made, but I can’t determine where the reset is coming from. Will keep investigating.