Connecting to shutdown servers hangs client a randomly time to emit error 14 UNAVAILABLE
See original GitHub issueProblem description
I have configured 4 clients, to make unary calls to 4 servers. Each client is an instance from a class, and stored in a js Map. Some calls need to do on all servers, and another calls only calls one. Because few servers are shutdown, I send the same call (with each client corresponding to the instance with the ip:port of each server) and get for those are shutdown error 14 UNAVAILABLE
. That’s fine! But randomly some clients hangs waiting for the connection error. So, every call to a shutdown server giveme error 14 at randomly time, from miliseconds, to 30 seconds!!! Why? I suppose the error should emited instantaneous. This cause me issues, because I need to check all servers before continue with de application code.
Reproduction steps
Create a simple grpc client and try to make an unary call to any no existent server ip:port. You get and error 14. Repeat the same call few times, and you will see that sometimes client hangs for seconds.
Environment
- OS name, version and architecture: Linux Centos 7 64bits
- Node Version 14.18.0
- Node installation method: Binary
- Package name and version @grpc/grpc-js@1.5.1 (tested with 1.5.3 also)
Additional context
I was simplified my client to only one to reproduce the same issue. I think this issue is relate with those are around there suche as #1591 #1815 or https://stackoverflow.com/questions/61565913/why-does-my-node-js-grpc-client-take-3-seconds-to-send-a-request-to-my-python-gr
Of course, I was setup a deadline for the client to avoid hangs waiting for the error 14, but this is a workaround. A 1 second deadline is ok, but it is not a fix.
This is the console output. As you can see, there is 30 seconds waiting to connect. I want to fail instantly
D 2022-01-25T23:35:05.884Z | channel | (1) dns:192.168.10.20:50051 createCall [11] method="/dvdriver.DriverService/GetAffiliation", deadline=Infinity
D 2022-01-25T23:35:05.885Z | call_stream | [11] Sending metadata
D 2022-01-25T23:35:05.885Z | channel | (1) dns:192.168.10.20:50051 Pick result: QUEUE subchannel: undefined status: undefined undefined
D 2022-01-25T23:35:05.885Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.ref | configSelectionQueue.length=0 pickQueue.length=1
D 2022-01-25T23:35:05.885Z | call_stream | [11] write() called with message of length 27
D 2022-01-25T23:35:05.885Z | call_stream | [11] end() called
D 2022-01-25T23:35:05.885Z | resolving_load_balancer | dns:192.168.10.20:50051 IDLE -> CONNECTING
D 2022-01-25T23:35:05.885Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.unref | configSelectionQueue.length=0 pickQueue.length=0
D 2022-01-25T23:35:05.885Z | channel | (1) dns:192.168.10.20:50051 Pick result: QUEUE subchannel: undefined status: undefined undefined
D 2022-01-25T23:35:05.885Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.ref | configSelectionQueue.length=0 pickQueue.length=1
D 2022-01-25T23:35:05.885Z | connectivity_state | (1) dns:192.168.10.20:50051 IDLE -> CONNECTING
D 2022-01-25T23:35:05.885Z | resolving_load_balancer | dns:192.168.10.20:50051 CONNECTING -> CONNECTING
D 2022-01-25T23:35:05.885Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.unref | configSelectionQueue.length=0 pickQueue.length=0
D 2022-01-25T23:35:05.885Z | channel | (1) dns:192.168.10.20:50051 Pick result: QUEUE subchannel: undefined status: undefined undefined
D 2022-01-25T23:35:05.885Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.ref | configSelectionQueue.length=0 pickQueue.length=1
D 2022-01-25T23:35:05.886Z | connectivity_state | (1) dns:192.168.10.20:50051 CONNECTING -> CONNECTING
D 2022-01-25T23:35:05.886Z | call_stream | [11] deferring writing data chunk of length 32
<--------------- 3 SECONDS LATER ------------->
D 2022-01-25T23:35:08.852Z | subchannel_refcount | (16) 192.168.10.20:50051 refcount 1 -> 0
<--------------- 27 SECONDS LATER ------------->
D 2022-01-25T23:35:32.889Z | dns_resolver | Returning IP address for target dns:192.168.10.20:50051
D 2022-01-25T23:35:32.890Z | pick_first | Connect to address list 192.168.10.20:50051
D 2022-01-25T23:35:32.890Z | subchannel | (19) 192.168.10.20:50051 Subchannel constructed with options {}
D 2022-01-25T23:35:32.890Z | subchannel_refcount | (19) 192.168.10.20:50051 refcount 0 -> 1
D 2022-01-25T23:35:32.890Z | subchannel_refcount | (19) 192.168.10.20:50051 refcount 1 -> 2
D 2022-01-25T23:35:32.890Z | pick_first | Start connecting to subchannel with address 192.168.10.20:50051
D 2022-01-25T23:35:32.890Z | pick_first | IDLE -> CONNECTING
D 2022-01-25T23:35:32.890Z | resolving_load_balancer | dns:192.168.10.20:50051 CONNECTING -> CONNECTING
D 2022-01-25T23:35:32.890Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.unref | configSelectionQueue.length=0 pickQueue.length=0
D 2022-01-25T23:35:32.890Z | channel | (1) dns:192.168.10.20:50051 Pick result: QUEUE subchannel: undefined status: undefined undefined
D 2022-01-25T23:35:32.890Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.ref | configSelectionQueue.length=0 pickQueue.length=1
D 2022-01-25T23:35:32.890Z | connectivity_state | (1) dns:192.168.10.20:50051 CONNECTING -> CONNECTING
D 2022-01-25T23:35:32.891Z | subchannel | (19) 192.168.10.20:50051 IDLE -> CONNECTING
D 2022-01-25T23:35:32.891Z | pick_first | CONNECTING -> CONNECTING
D 2022-01-25T23:35:32.891Z | resolving_load_balancer | dns:192.168.10.20:50051 CONNECTING -> CONNECTING
D 2022-01-25T23:35:32.891Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.unref | configSelectionQueue.length=0 pickQueue.length=0
D 2022-01-25T23:35:32.891Z | channel | (1) dns:192.168.10.20:50051 Pick result: QUEUE subchannel: undefined status: undefined undefined
D 2022-01-25T23:35:32.891Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.ref | configSelectionQueue.length=0 pickQueue.length=1
D 2022-01-25T23:35:32.891Z | connectivity_state | (1) dns:192.168.10.20:50051 CONNECTING -> CONNECTING
D 2022-01-25T23:35:32.891Z | channel | (1) dns:192.168.10.20:50051 callRefTimer.unref | configSelectionQueue.length=0 pickQueue.length=1
D 2022-01-25T23:35:32.891Z | subchannel | (19) 192.168.10.20:50051 creating HTTP/2 session
D 2022-01-25T23:35:32.895Z | subchannel | (19) 192.168.10.20:50051 connection closed with error connect EHOSTUNREACH 192.168.10.20:50051
D 2022-01-25T23:35:32.895Z | subchannel | (19) 192.168.10.20:50051 connection closed
D 2022-01-25T23:35:32.895Z | subchannel | (19) 192.168.10.20:50051 CONNECTING -> TRANSIENT_FAILURE
D 2022-01-25T23:35:32.895Z | pick_first | CONNECTING -> TRANSIENT_FAILURE
D 2022-01-25T23:35:32.895Z | resolving_load_balancer | dns:192.168.10.20:50051 CONNECTING -> TRANSIENT_FAILURE
D 2022-01-25T23:35:32.895Z | channel | (1) dns:192.168.10.20:50051 Pick result: TRANSIENT_FAILURE subchannel: undefined status: 14 No connection established
D 2022-01-25T23:35:32.895Z | call_stream | [11] cancelWithStatus code: 14 details: "No connection established"
D 2022-01-25T23:35:32.895Z | call_stream | [11] ended with status: code=14 details="No connection established"
D 2022-01-25T23:35:32.895Z | connectivity_state | (1) dns:192.168.10.20:50051 CONNECTING -> TRANSIENT_FAILURE
<-------- FINALLY GET ERROR 14!!!! --------------->
Only for reference, this is the console output when I instanciate the client class:
D 2022-01-26T03:31:29.762Z | index | Loading @grpc/grpc-js version 1.5.1
D 2022-01-26T03:31:29.976Z | resolving_load_balancer | dns:192.168.10.10:50050 IDLE -> IDLE
D 2022-01-26T03:31:29.976Z | connectivity_state | (1) dns:192.168.10.10:50050 IDLE -> IDLE
D 2022-01-26T03:31:29.977Z | dns_resolver | Resolver constructed for target dns:192.168.10.10:50050
D 2022-01-26T03:31:29.978Z | channel | (1) dns:192.168.10.10:50050 Channel constructed with options {}
Thanks for the help and this module!
Regards, Normando
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Top GitHub Comments
I was not able to reproduce it, but fortunately I think I found the bug anyway. I published
@grpc/grpc-js
version 1.5.4 with a change that I think fixes this bug. Can you try it out?Well, I can confirm that 1.5.4 fix this issue! I tested a lot of times with 1.5.3 and 1.5.4. Seams also that 1.5.3 fails only when I try to connect to an IP, not a FQDN. But I am not sure 100% about this. But definitely 1.5.4 fix this issue, and also the 14 error response is more fastest.
Thanks!