Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keepalive server side not working

See original GitHub issue

Problem description

Grpc server does not take keepalive options into account. Not possible to detect client disconnection during server streaming.

Reproduction steps

Create a server with keepalive options (several combinations tested)

        const grpcOptions = {
            'grpc.keepalive_time_ms': 5000,
            'grpc.keepalive_timeout_ms': 5000,
            'grpc.grpc.max_connection_idle_ms': 5000,
            'grpc.keepalive_permit_without_calls': 1,
            'grpc.http2.max_pings_without_data': 2000000,
            'grpc.http2.max_ping_strikes': 1,
            // 'grpc.http2.min_sent_ping_interval_without_data_ms': 5000,
            // 'grpc.http2.min_time_between_pings_ms': 10000,
            // 'grpc.http2.min_ping_interval_without_data_ms': 5000
        };

Have a client with no keepalive option call a bidir streaming method.

After a while, no information circulate on the grpc connection, but the stream remains open. Client sends data at least once a day. Physically disconnect client from server.

Server should detect disconnection thanks to keepalive messages. But it seems that albeit keepalive options, the keepalive messages are not sent as seen in logs. So nothing is detected.

Environment

OS name, version and architecture: Linux Ubuntu 16.04 amd64
Node version 14.8.0
Node installation method curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash - && sudo apt-get install -y nodejs
If applicable, compiler version NA
Package name and version gRPC@1.24.3

Additional context

I0818 10:22:59.721528027 18301 parsing.cc:430] I0818 10:22:59.721534878 18301 parsing.cc:430] I0818 10:22:59.721539936 18301 parsing.cc:430] I0818 10:22:59.721546550 18301 parsing.cc:430] I0818 10:22:59.721555060 18301 parsing.cc:686] I0818 10:22:59.721559189 18301 parsing.cc:541] I0818 10:22:59.721563049 18301 parsing.cc:541] I0818 10:23:04.603784803 18301 chttp2_transport.cc:839] I0818 10:23:04.603850403 18301 writing.cc:89] I0818 10:23:04.603869242 18301 chttp2_transport.cc:839] I0818 10:27:59.583702051 18301 chttp2_transport.cc:839] I0818 10:27:59.583750142 18301 writing.cc:116] I0818 10:27:59.583765218 18301 chttp2_transport.cc:839] I0818 10:27:59.583841187 18301 chttp2_transport.cc:2660] I0818 10:27:59.583861770 18301 chttp2_transport.cc:2808] I0818 10:27:59.583881177 18301 chttp2_transport.cc:839] I0818 10:27:59.602496376 18301 chttp2_transport.cc:839] I0818 10:27:59.602544504 18301 chttp2_transport.cc:2676] I0818 10:27:59.602569380 18301 chttp2_transport.cc:2821] I0818 10:27:59.602589152 18301 chttp2_transport.cc:839] I0818 10:27:59.602662284 18301 chttp2_transport.cc:839] I0818 10:28:04.602829015 18301 chttp2_transport.cc:839] I0818 10:28:04.602896452 18301 writing.cc:89] I0818 10:28:04.602917480 18301 chttp2_transport.cc:839] HTTP:3:HDR:CLI: x-envoy-peer-metadata-id: 73 69 64 65 63 61 72 7e 31 30 2e 36 30 2e 31 2e 31 32 7e 70 72 6f 64 75 63 74 2d 31 2d 31 2d 30 2d 62 2d 32 2d 73 6e 61 70 73 68 6f 74 2d 36 63 63 39 66 64 36 35 39 2d 7a 76 39 71 36 2e 64 65 66 61 75 6c 74 7e 64 65 66 61 75 6c 74 2e 73 76 63 2e 63 6c 75 73 74 65 72 2e 6c 6f 63 61 6c ‘sidecar~10.60.1.12~product-1-1-0-b-2-snapshot-6cc9fd659-zv9q6.default~default.svc.cluster.local’ HTTP:3:HDR:CLI: date: 54 75 65 2c 20 31 38 20 41 75 67 20 32 30 32 30 20 30 38 3a 32 32 3a 35 39 20 47 4d 54 ‘Tue, 18 Aug 2020 08:22:59 GMT’ HTTP:3:HDR:CLI: server: 69 73 74 69 6f 2d 65 6e 76 6f 79 ‘istio-envoy’ HTTP:3:HDR:CLI: x-envoy-decorator-operation: 70 72 6f 64 75 63 74 2d 31 2d 31 2d 30 2d 62 2d 32 2d 73 6e 61 70 73 68 6f 74 2e 64 65 66 61 75 6c 74 2e 73 76 63 2e 63 6c 75 73 74 65 72 2e 6c 6f 63 61 6c 3a 35 30 30 35 35 2f 2a ‘product-1-1-0-b-2-snapshot.default.svc.cluster.local:50055/*’ parsing trailing_metadata HTTP:3:TRL:CLI: grpc-status: 30 ‘0’ HTTP:3:TRL:CLI: grpc-message: 4f 4b ‘OK’ W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [KEEPALIVE_PING] SERVER: Ping delayed [0x3e12430]: not enough time elapsed since last ping. Last ping 32146.000000: Next ping 332146.000000: Now 37166.000000 W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [begin writing nothing] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [RETRY_SEND_PING] SERVER: Ping sent [ipv4:127.0.0.1:35288]: 1999999/2000000 W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> WRITING [begin write in current thread] ipv4:127.0.0.1:35288: Start BDP ping err=“No Error” ipv4:127.0.0.1:35288: Start keepalive ping W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [finish writing] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [PING_RESPONSE] ipv4:127.0.0.1:35288: Complete BDP ping err=“No Error” ipv4:127.0.0.1:35288: Finish keepalive ping W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> WRITING [begin write in current thread] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [finish writing] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [KEEPALIVE_PING] SERVER: Ping delayed [0x3e12430]: not enough time elapsed since last ping. Last ping 332146.000000: Next ping 632146.000000: Now 337165.000000 W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [begin writing nothing]

Issue Analytics

State:
Created 3 years ago
Comments:11 (9 by maintainers)

Top GitHub Comments

1reaction

Patrick-Remycommented, Mar 5, 2021

I am not sure, if that helped, and if my bug was related to grpc or to the grpc server side application.

Some months since that, the stream worked pretty well. But then we encountered it some weeks ago again, but this time we got UNAVAILABLE: keepalive watchdog timeout exceptions and the server was not reachable and the application throwed exceptions. I am not sure if this is was related to the streamSettings I did set. Unfortunately we have no influence to the server app, and client and server keepalive settings need to match. The default server keepalive settings are pretty high (2 hours): [https://github.com/grpc/grpc/blob/master/doc/keepalive.md#defaults-values](keepalive defaults). And if you set lower values in your client, you get a GOAWAY from the server.

But enabling keepalive in the client is always not bad:

const streamOptions = {
    // send pings every X seconds if there is no activity,
    // it is limited by min_sent_ping_interval_without_data_ms/min_time_between_pings_ms, but prints out 
    // some logs if `DEBUG=1 GRPC_VERBOSITY=DEBUG GRPC_TRACE=connectivity_state,http_keepalive`
    // enabled  :-)
    'grpc.keepalive_time_ms': 15000

    /**
     * Following values cannot be set as they cause a GOAWAY from server :-(
     */
    // wait timeout for ping ack before considering the connection dead
    // 'grpc.keepalive_timeout_ms': 15000
    // send pings even without active streams
    // 'grpc.keepalive_permit_without_calls': 1
    // always send pings
    // 'grpc.http2.max_pings_without_data': 0,
    // same ping interval without data, as with data
    // 'grpc.http2.min_sent_ping_interval_without_data_ms': 5000,
    // same as above for compatibility reasons, see https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga69583c8efdbdcb7cdf9055ee80a07014
    // 'grpc.http2.min_time_between_pings_ms': 30000
    // recreate connection after 2m? consider if useful for debugging
    // 'grpc.grpc.max_connection_age_ms': 2 * 60 * 1000
}
const client = new grpc.Client(address, credentials, streamOptions)

As stated in my comments, grpc.keepalive_time_ms enables the keepalive, but effectively is set to min_sent_ping_interval_without_data_ms/min_time_between_pings_ms (5min by default). But it leads to debug logging messages like not sending keepalive, as min_sent_ping_interval_without_data_ms not reached which was helpful to see that our client application is still doing something.

In our case we are receiving data every few seconds by the stream, so the keepalive which is only send if there was no recent activity should be send very fast and every x seconds to discover connection problems very fast.

@murgatroid99 Do you know the reasons why the server default values are so high? Or am I misunderstanding these values? Is recently read about an idleTimeout to auto-close the stream. Possible this can help us, how to set it via JS?

0reactions

murgatroid99commented, Sep 2, 2021

if a connection get dropped, currently the client does’t do any request so that would initiate a reconnect

If the server ends your stream, you should start a new one if you want to continue to have an active stream.

Unfortunately I do not have control over the servers implementation.

I’m not saying that you should do anything on the server. I’m explaining what the client behavior should be, which depends on what the server does. If the server continues sending messages, the client doesn’t need to do anything special, it will just keep getting messages. Otherwise, if the server ends the stream with a status (which you will see in the status event), the client will need to create a new stream to continue getting messages.

If the server both stops sending messages and does not end the stream with a status, that is misbehavior on the part of the server that the client cannot handle.

if I look to the source of _emitStatusIfDone, as a question why it is relevant that this.read_status AND this.received_status is set

The point of that is to ensure that all incoming messages have been processed before emitting the status. Lower layers use the status as a signal that all incoming messages have been received, so there should not be any code path that receives a status but never sets read_status.