Failure detector closes connection if linkerd receives data faster than it can write
See original GitHub issueIssue Type:
- Bug report
- Feature request
What happened:
We have a setup where we ship data between applications using stream-to-unary gRPC calls. Traffic goes through linkerd and our application also uses linkerd’s gRPC implementation. The total amount of data can reach tens of GBs, individual messages are less than 64KB. We’ve seen streams getting sporadically cancelled with Reset.Cancel
. Retrying the stream once or twice is usually enough to get around the issue.
With debug logging turned on, we can see that cancellations are caused by failure detector
May 29 00:25:43 <hostname> bash[34214]: D 0529 00:25:43.199 EEST THREAD15: failure detector closed HTTP/2 connection: C L:/127.0.0.1:41456 R:127.0.0.1/127.0.0.1:64446
However, there seem to be no issues with network and RTT between servers is sub-millisecond. In fact, as seen from the above snippet failure detector closes even localhost connections. There are no GC pauses or similar which would explain this.
While digging into this issue, I often saw that failure detector ping is added to event loop queue but it is executed only a few seconds later or, ultimately, not at all when failure detector closes the connection.
The reason for delay seems to be BufferingChannelTransport
. When a frame is written to BufferingChannelTransport
, it is added to a queue and a flushing task is executed on event loop. The task continues flushing until the queue is empty. In situations where data is written to the queue faster than it can be flushed, the flushing task can take arbitrarily long to complete. This prevents any other task queued in the event loop from running. Since ping countdown starts as soon as ping task is queued, we can end up with failure detector closing a healthy connection which just happens to be too busy.
A fix/workaround is to limit the number of frames BufferingChannelTransport
flushes in one go before creating a new task. It’s still possible that the ping task is queued for too long (esp. if RTT is naturally high) but the likelihood of that should be reduced.
What you expected to happen:
Failure detector should not close healthy connections just because there is too much traffic.
How to reproduce it (as minimally and precisely as possible):
Stream data from client to server via 2 linkerds. Everything can be running on a single machine.
Example service definition:
service StreamService {
rpc Transmit (stream DataChunk) returns (StatusResponse);
}
message DataChunk {
bytes data = 1;
}
message StatusResponse {
enum Status {
OK = 0;
ERROR = 1;
};
Status status = 1;
}
Data can be whatever. We’ve run with chunk sizes of 32KB and 63KB.
Linkerd configurations:
# Outgoing (client)
admin:
port: 9999
routers:
- protocol: h2
interpreter:
kind: default
dtab: |
/svc => /$/inet/127.0.0.1/5144;
servers:
- port: 4144
initialStreamWindowBytes: 2147483647
maxFrameBytes: 16777215
client:
initialStreamWindowBytes: 2147483647
maxFrameBytes: 16777215
# Incoming (server)
admin:
port: 10009
routers:
- protocol: h2
interpreter:
kind: default
dtab: |
/svc => /$/inet/127.0.0.1/4321; # Server runs at 4321
servers:
- port: 5144
initialStreamWindowBytes: 2147483647
maxFrameBytes: 16777215
client:
initialStreamWindowBytes: 2147483647
maxFrameBytes: 16777215
Issue reproduces most consistently if all components (linkerds/client/server) initialStreamWindowBytes
and maxFrameBytes
set to their maximum values. However, I don’t believe those are inherently related to the issue. They merely improve throughput which triggers the bug.
Environment: Happens at least with linkerd 1.7.X.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
The failure detector is designed to help detect network issues, especially in cases where requests are infrequent. In your situation it sounds like your network is fairly reliable. I think turning it off should be fairly safe for you, especially if your connections don’t have long periods of idleness.
Ping @cpretzer, any comments on the question above? Atm we are planning on rolling out the workaround but it would be good to know if it potentially has any other effects.