Keepalive server side not working
See original GitHub issueProblem description
Grpc server does not take keepalive options into account. Not possible to detect client disconnection during server streaming.
Reproduction steps
Create a server with keepalive options (several combinations tested)
const grpcOptions = {
'grpc.keepalive_time_ms': 5000,
'grpc.keepalive_timeout_ms': 5000,
'grpc.grpc.max_connection_idle_ms': 5000,
'grpc.keepalive_permit_without_calls': 1,
'grpc.http2.max_pings_without_data': 2000000,
'grpc.http2.max_ping_strikes': 1,
// 'grpc.http2.min_sent_ping_interval_without_data_ms': 5000,
// 'grpc.http2.min_time_between_pings_ms': 10000,
// 'grpc.http2.min_ping_interval_without_data_ms': 5000
};
Have a client with no keepalive option call a bidir streaming method.
After a while, no information circulate on the grpc connection, but the stream remains open. Client sends data at least once a day. Physically disconnect client from server.
Server should detect disconnection thanks to keepalive messages. But it seems that albeit keepalive options, the keepalive messages are not sent as seen in logs. So nothing is detected.
Environment
- OS name, version and architecture: Linux Ubuntu 16.04 amd64
- Node version 14.8.0
- Node installation method curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash - && sudo apt-get install -y nodejs
- If applicable, compiler version NA
- Package name and version gRPC@1.24.3
Additional context
I0818 10:22:59.721528027 18301 parsing.cc:430] HTTP:3:HDR:CLI: x-envoy-peer-metadata-id: 73 69 64 65 63 61 72 7e 31 30 2e 36 30 2e 31 2e 31 32 7e 70 72 6f 64 75 63 74 2d 31 2d 31 2d 30 2d 62 2d 32 2d 73 6e 61 70 73 68 6f 74 2d 36 63 63 39 66 64 36 35 39 2d 7a 76 39 71 36 2e 64 65 66 61 75 6c 74 7e 64 65 66 61 75 6c 74 2e 73 76 63 2e 63 6c 75 73 74 65 72 2e 6c 6f 63 61 6c ‘sidecar~10.60.1.12~product-1-1-0-b-2-snapshot-6cc9fd659-zv9q6.default~default.svc.cluster.local’ I0818 10:22:59.721534878 18301 parsing.cc:430] HTTP:3:HDR:CLI: date: 54 75 65 2c 20 31 38 20 41 75 67 20 32 30 32 30 20 30 38 3a 32 32 3a 35 39 20 47 4d 54 ‘Tue, 18 Aug 2020 08:22:59 GMT’ I0818 10:22:59.721539936 18301 parsing.cc:430] HTTP:3:HDR:CLI: server: 69 73 74 69 6f 2d 65 6e 76 6f 79 ‘istio-envoy’ I0818 10:22:59.721546550 18301 parsing.cc:430] HTTP:3:HDR:CLI: x-envoy-decorator-operation: 70 72 6f 64 75 63 74 2d 31 2d 31 2d 30 2d 62 2d 32 2d 73 6e 61 70 73 68 6f 74 2e 64 65 66 61 75 6c 74 2e 73 76 63 2e 63 6c 75 73 74 65 72 2e 6c 6f 63 61 6c 3a 35 30 30 35 35 2f 2a ‘product-1-1-0-b-2-snapshot.default.svc.cluster.local:50055/*’ I0818 10:22:59.721555060 18301 parsing.cc:686] parsing trailing_metadata I0818 10:22:59.721559189 18301 parsing.cc:541] HTTP:3:TRL:CLI: grpc-status: 30 ‘0’ I0818 10:22:59.721563049 18301 parsing.cc:541] HTTP:3:TRL:CLI: grpc-message: 4f 4b ‘OK’ I0818 10:23:04.603784803 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [KEEPALIVE_PING] I0818 10:23:04.603850403 18301 writing.cc:89] SERVER: Ping delayed [0x3e12430]: not enough time elapsed since last ping. Last ping 32146.000000: Next ping 332146.000000: Now 37166.000000 I0818 10:23:04.603869242 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [begin writing nothing] I0818 10:27:59.583702051 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [RETRY_SEND_PING] I0818 10:27:59.583750142 18301 writing.cc:116] SERVER: Ping sent [ipv4:127.0.0.1:35288]: 1999999/2000000 I0818 10:27:59.583765218 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> WRITING [begin write in current thread] I0818 10:27:59.583841187 18301 chttp2_transport.cc:2660] ipv4:127.0.0.1:35288: Start BDP ping err=“No Error” I0818 10:27:59.583861770 18301 chttp2_transport.cc:2808] ipv4:127.0.0.1:35288: Start keepalive ping I0818 10:27:59.583881177 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [finish writing] I0818 10:27:59.602496376 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [PING_RESPONSE] I0818 10:27:59.602544504 18301 chttp2_transport.cc:2676] ipv4:127.0.0.1:35288: Complete BDP ping err=“No Error” I0818 10:27:59.602569380 18301 chttp2_transport.cc:2821] ipv4:127.0.0.1:35288: Finish keepalive ping I0818 10:27:59.602589152 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> WRITING [begin write in current thread] I0818 10:27:59.602662284 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [finish writing] I0818 10:28:04.602829015 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state IDLE -> WRITING [KEEPALIVE_PING] I0818 10:28:04.602896452 18301 writing.cc:89] SERVER: Ping delayed [0x3e12430]: not enough time elapsed since last ping. Last ping 332146.000000: Next ping 632146.000000: Now 337165.000000 I0818 10:28:04.602917480 18301 chttp2_transport.cc:839] W:0x3cce920 SERVER [ipv4:127.0.0.1:35288] state WRITING -> IDLE [begin writing nothing]
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (9 by maintainers)
Top GitHub Comments
I am not sure, if that helped, and if my bug was related to grpc or to the grpc server side application.
Some months since that, the stream worked pretty well. But then we encountered it some weeks ago again, but this time we got
UNAVAILABLE: keepalive watchdog timeout
exceptions and the server was not reachable and the application throwed exceptions. I am not sure if this is was related to the streamSettings I did set. Unfortunately we have no influence to the server app, and client and server keepalive settings need to match. The default server keepalive settings are pretty high (2 hours): [https://github.com/grpc/grpc/blob/master/doc/keepalive.md#defaults-values](keepalive defaults). And if you set lower values in your client, you get aGOAWAY
from the server.But enabling keepalive in the client is always not bad:
As stated in my comments,
grpc.keepalive_time_ms
enables the keepalive, but effectively is set tomin_sent_ping_interval_without_data_ms
/min_time_between_pings_ms
(5min by default). But it leads to debug logging messages likenot sending keepalive, as min_sent_ping_interval_without_data_ms not reached
which was helpful to see that our client application is still doing something.In our case we are receiving data every few seconds by the stream, so the keepalive which is only send if there was no recent activity should be send very fast and every x seconds to discover connection problems very fast.
@murgatroid99 Do you know the reasons why the server default values are so high? Or am I misunderstanding these values? Is recently read about an idleTimeout to auto-close the stream. Possible this can help us, how to set it via JS?
If the server ends your stream, you should start a new one if you want to continue to have an active stream.
I’m not saying that you should do anything on the server. I’m explaining what the client behavior should be, which depends on what the server does. If the server continues sending messages, the client doesn’t need to do anything special, it will just keep getting messages. Otherwise, if the server ends the stream with a status (which you will see in the
status
event), the client will need to create a new stream to continue getting messages.If the server both stops sending messages and does not end the stream with a status, that is misbehavior on the part of the server that the client cannot handle.
The point of that is to ensure that all incoming messages have been processed before emitting the status. Lower layers use the status as a signal that all incoming messages have been received, so there should not be any code path that receives a status but never sets
read_status
.