Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failure detector closes connection if linkerd receives data faster than it can write

See original GitHub issue

Issue Type:

Bug report
Feature request

What happened: We have a setup where we ship data between applications using stream-to-unary gRPC calls. Traffic goes through linkerd and our application also uses linkerd’s gRPC implementation. The total amount of data can reach tens of GBs, individual messages are less than 64KB. We’ve seen streams getting sporadically cancelled with Reset.Cancel. Retrying the stream once or twice is usually enough to get around the issue.

With debug logging turned on, we can see that cancellations are caused by failure detector

May 29 00:25:43 <hostname> bash[34214]: D 0529 00:25:43.199 EEST THREAD15: failure detector closed HTTP/2 connection: C L:/127.0.0.1:41456 R:127.0.0.1/127.0.0.1:64446

However, there seem to be no issues with network and RTT between servers is sub-millisecond. In fact, as seen from the above snippet failure detector closes even localhost connections. There are no GC pauses or similar which would explain this.

While digging into this issue, I often saw that failure detector ping is added to event loop queue but it is executed only a few seconds later or, ultimately, not at all when failure detector closes the connection.

The reason for delay seems to be BufferingChannelTransport. When a frame is written to BufferingChannelTransport, it is added to a queue and a flushing task is executed on event loop. The task continues flushing until the queue is empty. In situations where data is written to the queue faster than it can be flushed, the flushing task can take arbitrarily long to complete. This prevents any other task queued in the event loop from running. Since ping countdown starts as soon as ping task is queued, we can end up with failure detector closing a healthy connection which just happens to be too busy.

A fix/workaround is to limit the number of frames BufferingChannelTransport flushes in one go before creating a new task. It’s still possible that the ping task is queued for too long (esp. if RTT is naturally high) but the likelihood of that should be reduced.

What you expected to happen:

Failure detector should not close healthy connections just because there is too much traffic.

How to reproduce it (as minimally and precisely as possible):

Stream data from client to server via 2 linkerds. Everything can be running on a single machine.

Example service definition:

service StreamService {
    rpc Transmit (stream DataChunk) returns (StatusResponse);
}

message DataChunk {
    bytes data = 1;
}

message StatusResponse {
    enum Status {
        OK = 0;
        ERROR = 1;
    };
    Status status = 1;
}

Data can be whatever. We’ve run with chunk sizes of 32KB and 63KB.

Linkerd configurations:

# Outgoing (client)
admin:
  port: 9999

routers:
- protocol: h2
  interpreter:
    kind: default
  dtab: |
    /svc => /$/inet/127.0.0.1/5144;
  servers:
  - port: 4144
    initialStreamWindowBytes: 2147483647
    maxFrameBytes: 16777215
  client:
    initialStreamWindowBytes: 2147483647
    maxFrameBytes: 16777215

# Incoming (server)
admin:
  port: 10009

routers:
- protocol: h2
  interpreter:
    kind: default
  dtab: |
    /svc => /$/inet/127.0.0.1/4321; # Server runs at 4321
  servers:
  - port: 5144
    initialStreamWindowBytes: 2147483647
    maxFrameBytes: 16777215
  client:
    initialStreamWindowBytes: 2147483647
    maxFrameBytes: 16777215

Issue reproduces most consistently if all components (linkerds/client/server) initialStreamWindowBytes and maxFrameBytes set to their maximum values. However, I don’t believe those are inherently related to the issue. They merely improve throughput which triggers the bug.

Environment: Happens at least with linkerd 1.7.X.

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

adleongcommented, Jun 18, 2020

The failure detector is designed to help detect network issues, especially in cases where requests are infrequent. In your situation it sounds like your network is fairly reliable. I think turning it off should be fairly safe for you, especially if your connections don’t have long periods of idleness.

0reactions

hmhagbergcommented, Jun 17, 2020

Ping @cpretzer, any comments on the question above? Atm we are planning on rolling out the workaround but it would be good to know if it potentially has any other effects.

Top Results From Across the Web

lInkerd-proxy connection refused/closed/reset errors. #5331

I am working on finding the root cause for an increased connection drops/failures on our linkerd injected apps which happens rarely and noticed ......

Troubleshooting | Linkerd

This section provides resolution steps for common problems reported with the linkerd check command. The “pre-kubernetes-cluster-setup” checks.

Linkerd Proxy Error: "Failed to proxy request: request timed out"

Hi all, I deployed the linkerd service mesh on a GKE cluster several weeks ago. Setup was straightforward and the mesh has been...

Under the hood of Linkerd's state-of-the-art Rust proxy ...

So, now we need to determine where the request is going. Linkerd routes HTTP traffic based on the target authority, which is either...

TCP Proxying and Protocol Detection - Linkerd

If Linkerd detects that a connection is HTTP or HTTP/2, Linkerd automatically provides HTTP-level metrics and routing.