question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add KeepAlive support

See original GitHub issue

With our sue of gRPC Java across Google Compute Engine (GCE) L3 Load Balancers (Network Load Balancers), we seem to be hitting similar issues we had with gRPC in Go: https://github.com/grpc/grpc-go/issues/536

Basically Google L3 load balancers silently drop long-lasting TCP connections after 600 seconds.

While we were able to work around the issue by specifying a custom Dialer in Go:

func WithKeepAliveDialer() grpc.DialOption {
    return grpc.WithDialer(func(addr string, timeout time.Duration) (net.Conn, error) {
        d := net.Dialer{Timeout: timeout, KeepAlive: *flagGrpcClientKeepAliveDuration}
        return d.Dial("tcp", addr)
    })
}

There seems to be no way of overriding the KeepAlive peridods for NettyClientTransport. We know it’s possible to set the keep alive period in the kernel of the machines, but that’s a bit of a stretch to expect the user-code programmers to know about it.

Can we either:

  • have the ability to specify the TCP keep alive period on create of channel
  • documentation around it, especially how it can cause hard-to-debug problems on GCE?

cc @ejona86 since he seems to have had opinions about it in https://github.com/grpc/grpc-java/issues/737

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:6
  • Comments:21 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
ejona86commented, Apr 8, 2016

Here is an except from the document I’m trying to get agreement on:

TCP keepalive is hard to configure in Java and Go. Enabling is easy, but one hour is far too infrequent to be useful; an application-level keepalive seems beneficial for configuration.

TCP keepalive is active even if there are no open streams. This wastes a substantial amount of battery on mobile; an application-level keepalive seems beneficial for optimization.

Application-level keepalive implies HTTP/2 PING. If we take a page from TCP keepalive’s book there are three parameters to tune: time (time since last receipt before sending a keepalive), interval (interval between keepalives when not receiving reply), and retry (number of times to retry sending keepalives). Interval and retry don’t quite apply to PING because the transport is reliable, so they will be replaced with timeout (equivalent to interval * retry), the time between sending a PING and not receiving any bytes to declare the connection dead.

Doing some form of keepalive is relatively straightforward. But avoiding DDoS is not as easy. Thus, avoiding DDoS is the most important part of the design. To mitigate DDoS the design:

  • Disables keepalive for HTTP/2 connections with no outstanding streams, and
  • Enforces a lower limit to the keepalive delay, namely no less than one minute

Most RPCs are unary with quick replies, so keepalive is less likely to be triggered. It would primarily be triggered when there is a long-lived RPC.

Since keepalive is not occurring on HTTP/2 connections without any streams, there will be a higher chance of failure for new RPCs following a long period of inactivity. To reduce the tail latency for these RPCs, it is important to not reset the `keepalive time’ when a connection becomes active; if a new stream is created and there has been greater than ‘keepalive time’ since the last read byte, then a keepalive PING should be sent (ideally before the HEADERS frame). Doing so detects the broken connection with a latency of 'keepalive timeout’ instead of 'keepalive time + timeout’.

'keepalive time’ is ideally measured from the time of the last byte read. However, simplistic implementations may choose to measure from the time of the last keepalive PING (aka, polling). Such implementations should take extra precautions to avoid issues due to latency added by outbound buffers, such as limiting the outbound buffer size and using a larger 'keepalive timeout’.

As an optional optimization, when 'keepalive timeout’ is exceeded, don’t kill the connection. Instead, start a new connection. If the new connection becomes ready and the old connection still hasn’t received any bytes, then kill the old connection. If the old connection wins the race, then kill the new connection mid-startup.

The 'keepalive time’ is expected to be an application-configurable option, with at least second precision. It is unspecified whether 'keepalive timeout’ is application-configurable, but it should be at least multiple times the round-trip time to allow for lost packets and TCP retransmits. It may also need to be higher to account for long garbage collector pauses.

1reaction
jhumpcommented, Jan 30, 2018

@ZedYu, true, but not all clients will necessarily ping. That strategy would work only in a perfectly homogenous environment where the server knew apriori how the client’s ping interval is configured.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keep-Alive - HTTP - MDN Web Docs
The Keep-Alive general header allows the sender to hint about how the connection may be used to set a timeout and a maximum...
Read more >
What is HTTP Keep Alive | Benefits of Connection ... - Imperva
HTTP keep-alive, a.k.a., HTTP persistent connection, is an instruction that allows a single TCP connection to remain open for multiple HTTP requests/responses.
Read more >
Improving Website Performance: Enabling Keep-Alive
Enabling Keep-Alive can help to optimize website's performance and deliver a better user experience. It allows a visitor's browser to reuse a single...
Read more >
What is Keep-Alive? - StackPath
Keep-Alive is is an instruction that allows a single TCP connection to remain open for multiple HTTP requests/responses.
Read more >
3. Using TCP keepalive under Linux
Linux has built-in support for keepalive. You need to enable TCP/IP networking in order to use it. You also need procfs support and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found