question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keepalive server side not working

See original GitHub issue

Problem description

Grpc server does not take keepalive options into account. Not possible to detect client disconnection during server streaming.

Reproduction steps

Create a server with keepalive options (several combinations tested)

        const grpcOptions = {
            'grpc.keepalive_time_ms': 5000,
            'grpc.keepalive_timeout_ms': 5000,
            'grpc.grpc.max_connection_idle_ms': 5000,
            'grpc.keepalive_permit_without_calls': 1,
            'grpc.http2.max_pings_without_data': 2000000,
            'grpc.http2.max_ping_strikes': 1,
            // 'grpc.http2.min_sent_ping_interval_without_data_ms': 5000,
            // 'grpc.http2.min_time_between_pings_ms': 10000,
            // 'grpc.http2.min_ping_interval_without_data_ms': 5000
        };

Have a client with no keepalive option call a bidir streaming method.

After a while, no information circulate on the grpc connection, but the stream remains open. Client sends data at least once a day. Physically disconnect client from server.

Server should detect disconnection thanks to keepalive messages. But it seems that albeit keepalive options, the keepalive messages are not sent as seen in logs. So nothing is detected.

Environment

  • OS name, version and architecture: Linux Ubuntu 16.04 amd64
  • Node version 14.8.0
  • Node installation method curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash - && sudo apt-get install -y nodejs
  • If applicable, compiler version NA
  • Package name and version gRPC@1.24.3

Additional context

I0818 10:22:59.721528027 18301 parsing.cc:430] HTTP:3:HDR:CLI: x-envoy-peer-metadata-id: 73 69 64 65 63 61 72 7e 31 30 2e 36 30 2e 31 2e 31 32 7e 70 72 6f 64 75 63 74 2d 31 2d 31 2d 30 2d 62 2d 32 2d 73 6e 61 70 73 68 6f 74 2d 36 63 63 39 66 64 36 35 39 2d 7a 76 39 71 36 2e 64 65 66 61 75 6c 74 7e 64 65 66 61 75 6c 74 2e 73 76 63 2e 63 6c 75 73 74 65 72 2e 6c 6f 63 61 6c ‘sidecar~10.60.1.12~product-1-1-0-b-2-snapshot-6cc9fd659-zv9q6.default~default.svc.cluster.local’ I0818 10:22:59.721534878 18301 parsing.cc:430] HTTP:3:HDR:CLI: date: 54 75 65 2c 20 31 38 20 41 75 67 20 32 30 32 30 20 30 38 3a 32 32 3a 35 39 20 47 4d 54 ‘Tue, 18 Aug 2020 08:22:59 GMT’ I0818 10:22:59.721539936 18301 parsing.cc:430] HTTP:3:HDR:CLI: server: 69 73 74 69 6f 2d 65 6e 76 6f 79 ‘istio-envoy’ I0818 10:22:59.721546550 18301 parsing.cc:430] HTTP:3:HDR:CLI: x-envoy-decorator-operation: 70 72 6f 64 75 63 74 2d 31 2d 31 2d 30 2d 62 2d 32 2d 73 6e 61 70 73 68 6f 74 2e 64 65 66 61 75 6c 74 2e 73 76 63 2e 63 6c 75 73 74 65 72 2e 6c 6f 63 61 6c 3a 35 30 30 35 35 2f 2a ‘product-1-1-0-b-2-snapshot.default.svc.cluster.local:50055/*’ I0818 10:22:59.721555060 18301 parsing.cc:686] parsing trailing_metadata I0818 10:22:59.721559189 18301 parsing.cc:541] HTTP:3:TRL:CLI: grpc-status: 30 ‘0’ I0818 10:22:59.721563049 18301 parsing.cc:541] HTTP:3:TRL:CLI: grpc-message: 4f 4b ‘OK’ I0818 10:23:04.603784803 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [KEEPALIVE_PING] I0818 10:23:04.603850403 18301 writing.cc:89] SERVER: Ping delayed [0x3e12430]: not enough time elapsed since last ping. Last ping 32146.000000: Next ping 332146.000000: Now 37166.000000 I0818 10:23:04.603869242 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [begin writing nothing] I0818 10:27:59.583702051 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [RETRY_SEND_PING] I0818 10:27:59.583750142 18301 writing.cc:116] SERVER: Ping sent [ipv4:127.0.0.1:35288]: 1999999/2000000 I0818 10:27:59.583765218 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> WRITING [begin write in current thread] I0818 10:27:59.583841187 18301 chttp2_transport.cc:2660] ipv4:127.0.0.1:35288: Start BDP ping err=“No Error” I0818 10:27:59.583861770 18301 chttp2_transport.cc:2808] ipv4:127.0.0.1:35288: Start keepalive ping I0818 10:27:59.583881177 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [finish writing] I0818 10:27:59.602496376 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [PING_RESPONSE] I0818 10:27:59.602544504 18301 chttp2_transport.cc:2676] ipv4:127.0.0.1:35288: Complete BDP ping err=“No Error” I0818 10:27:59.602569380 18301 chttp2_transport.cc:2821] ipv4:127.0.0.1:35288: Finish keepalive ping I0818 10:27:59.602589152 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> WRITING [begin write in current thread] I0818 10:27:59.602662284 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [finish writing] I0818 10:28:04.602829015 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [KEEPALIVE_PING] I0818 10:28:04.602896452 18301 writing.cc:89] SERVER: Ping delayed [0x3e12430]: not enough time elapsed since last ping. Last ping 332146.000000: Next ping 632146.000000: Now 337165.000000 I0818 10:28:04.602917480 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [begin writing nothing]

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
Patrick-Remycommented, Mar 5, 2021

I am not sure, if that helped, and if my bug was related to grpc or to the grpc server side application.

Some months since that, the stream worked pretty well. But then we encountered it some weeks ago again, but this time we got UNAVAILABLE: keepalive watchdog timeout exceptions and the server was not reachable and the application throwed exceptions. I am not sure if this is was related to the streamSettings I did set. Unfortunately we have no influence to the server app, and client and server keepalive settings need to match. The default server keepalive settings are pretty high (2 hours): [https://github.com/grpc/grpc/blob/master/doc/keepalive.md#defaults-values](keepalive defaults). And if you set lower values in your client, you get a GOAWAY from the server.

But enabling keepalive in the client is always not bad:

const streamOptions = {
    // send pings every X seconds if there is no activity,
    // it is limited by min_sent_ping_interval_without_data_ms/min_time_between_pings_ms, but prints out 
    // some logs if `DEBUG=1 GRPC_VERBOSITY=DEBUG GRPC_TRACE=connectivity_state,http_keepalive`
    // enabled  :-)
    'grpc.keepalive_time_ms': 15000

    /**
     * Following values cannot be set as they cause a GOAWAY from server :-(
     */
    // wait timeout for ping ack before considering the connection dead
    // 'grpc.keepalive_timeout_ms': 15000
    // send pings even without active streams
    // 'grpc.keepalive_permit_without_calls': 1
    // always send pings
    // 'grpc.http2.max_pings_without_data': 0,
    // same ping interval without data, as with data
    // 'grpc.http2.min_sent_ping_interval_without_data_ms': 5000,
    // same as above for compatibility reasons, see https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga69583c8efdbdcb7cdf9055ee80a07014
    // 'grpc.http2.min_time_between_pings_ms': 30000
    // recreate connection after 2m? consider if useful for debugging
    // 'grpc.grpc.max_connection_age_ms': 2 * 60 * 1000
}
const client = new grpc.Client(address, credentials, streamOptions)

As stated in my comments, grpc.keepalive_time_ms enables the keepalive, but effectively is set to min_sent_ping_interval_without_data_ms/min_time_between_pings_ms (5min by default). But it leads to debug logging messages like not sending keepalive, as min_sent_ping_interval_without_data_ms not reached which was helpful to see that our client application is still doing something.

In our case we are receiving data every few seconds by the stream, so the keepalive which is only send if there was no recent activity should be send very fast and every x seconds to discover connection problems very fast.

@murgatroid99 Do you know the reasons why the server default values are so high? Or am I misunderstanding these values? Is recently read about an idleTimeout to auto-close the stream. Possible this can help us, how to set it via JS?

0reactions
murgatroid99commented, Sep 2, 2021

if a connection get dropped, currently the client does’t do any request so that would initiate a reconnect

If the server ends your stream, you should start a new one if you want to continue to have an active stream.

Unfortunately I do not have control over the servers implementation.

I’m not saying that you should do anything on the server. I’m explaining what the client behavior should be, which depends on what the server does. If the server continues sending messages, the client doesn’t need to do anything special, it will just keep getting messages. Otherwise, if the server ends the stream with a status (which you will see in the status event), the client will need to create a new stream to continue getting messages.

If the server both stops sending messages and does not end the stream with a status, that is misbehavior on the part of the server that the client cannot handle.

if I look to the source of _emitStatusIfDone, as a question why it is relevant that this.read_status AND this.received_status is set

The point of that is to ensure that all incoming messages have been processed before emitting the status. Lower layers use the status as a signal that all incoming messages have been received, so there should not be any code path that receives a status but never sets read_status.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sockets - C / C++ Keepalive works on the server side but not ...
But it works fine on the server side. According to the parameters I gave, if the internet connection is broken, it closes the...
Read more >
HTTP keep-alive problem: Need it on client side, but can't ...
Hi! I'm in a situation where the server side can't handle HTTP keep-alive connections, but I need to have it enabled on the...
Read more >
HTTP Keep Alive Handling on the server side
I've been doing some more code reading and I'm beginning to understand how the requests come in, and in the process I think...
Read more >
Improving Website Performance: Enabling Keep-Alive
Keep-Alive, also known as a persistent connection, is a communication pattern between a server and a client to reduce the HTTP request ...
Read more >
Trying to understand how keep-alive works - Server Fault
But I am worried about scalability of that solution and it doesn't seems too reliable — should load increase slightly problem will occur...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found