grpc-js: requests hang indefinitely when executed while the `ResolvingLoadBalancer` class is in the `TRANSIENT_FAILURE` state
See original GitHub issueProblem description
gRPC requests executed when the ResolvingLoadBalancer
class is in the TRANSIENT_FAILURE
state hang indefinitely, even after the backoff timer has finished and reset the resolver back to the IDLE
state.
This can occur when the gRPC client’s DNS resolution fails but the client continues to send requests to the service.
Reproduction steps
- Clone this repo: https://github.com/chrskrchr/grpc-js-dns-hang
- Run
npm install
- Run
npm run start
- The script does the following:
- Creates a gRPC client with a dummy service definition and a bogus address
- Executes a request to the bogus address that fails immediately with the expected error:
Error: 14 UNAVAILABLE: Name resolution failed for target dns:bogus.host
- Sleeps for
1000ms
, which is notably less than the client’s configured2500ms
backoff setting on L38:"grpc.initial_reconnect_backoff_ms": 2500
- Executes a second request to the bogus address
This second request hangs indefinitely, event after the reconnect backoff expires and the resolver has been set back to the IDLE
state.
Environment
- macOS Monterrey (12.3)
- Node v14.18.1
- Node installation method: nvm
- grpc-js@1.6.3
Additional context
Output from the script when the second request is executed while the resolver is in the TRANSIENT_FAILURE
state, causing the second request to hang indefinitely:
➜ grpc-js-dns-hang git:(master) ✗ npm run start
> grpc-js-dns-hang@1.0.0 start /Users/chris.karcher/src/care/grpc-js-dns-hang
> GRPC_VERBOSITY=DEBUG GRPC_TRACE=all node index.js
D 2022-04-12T19:58:19.914Z | index | Loading @grpc/grpc-js version 1.6.3
D 2022-04-12T19:58:19.992Z | resolving_load_balancer | dns:bogus.host IDLE -> IDLE
D 2022-04-12T19:58:19.992Z | connectivity_state | (1) dns:bogus.host IDLE -> IDLE
D 2022-04-12T19:58:19.992Z | dns_resolver | Resolver constructed for target dns:bogus.host
D 2022-04-12T19:58:19.994Z | channel | (1) dns:bogus.host Channel constructed with options {
"grpc.initial_reconnect_backoff_ms": 2500
}
D 2022-04-12T19:58:19.994Z | channel_stacktrace | (1) Channel constructed
at new ChannelImplementation (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/channel.js:189:23)
at new Client (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client.js:62:36)
at new ServiceClientImpl (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/make-client.js:58:5)
at Object.<anonymous> (/Users/chris.karcher/src/care/grpc-js-dns-hang/index.js:37:16)
at Module._compile (internal/modules/cjs/loader.js:1085:14)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
at Module.load (internal/modules/cjs/loader.js:950:32)
at Function.Module._load (internal/modules/cjs/loader.js:790:12)
at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
at internal/main/run_main_module.js:17:47
executing request #1
D 2022-04-12T19:58:19.996Z | channel | (1) dns:bogus.host createCall [0] method="/PingAPI/Ping", deadline=Infinity
D 2022-04-12T19:58:19.997Z | call_stream | [0] Sending metadata
D 2022-04-12T19:58:19.997Z | dns_resolver | Looking up DNS hostname bogus.host
D 2022-04-12T19:58:19.999Z | resolving_load_balancer | dns:bogus.host IDLE -> CONNECTING
D 2022-04-12T19:58:20.000Z | connectivity_state | (1) dns:bogus.host IDLE -> CONNECTING
D 2022-04-12T19:58:20.000Z | resolving_load_balancer | dns:bogus.host CONNECTING -> CONNECTING
D 2022-04-12T19:58:20.000Z | connectivity_state | (1) dns:bogus.host CONNECTING -> CONNECTING
D 2022-04-12T19:58:20.000Z | channel | (1) dns:bogus.host callRefTimer.ref | configSelectionQueue.length=1 pickQueue.length=0
D 2022-04-12T19:58:20.001Z | call_stream | [0] write() called with message of length 0
D 2022-04-12T19:58:20.001Z | call_stream | [0] end() called
D 2022-04-12T19:58:20.002Z | call_stream | [0] deferring writing data chunk of length 5
D 2022-04-12T19:58:20.066Z | dns_resolver | Resolution error for target dns:bogus.host: getaddrinfo ENOTFOUND bogus.host
D 2022-04-12T19:58:20.067Z | resolving_load_balancer | dns:bogus.host CONNECTING -> TRANSIENT_FAILURE
D 2022-04-12T19:58:20.067Z | channel | (1) dns:bogus.host callRefTimer.unref | configSelectionQueue.length=1 pickQueue.length=0
D 2022-04-12T19:58:20.067Z | connectivity_state | (1) dns:bogus.host CONNECTING -> TRANSIENT_FAILURE
D 2022-04-12T19:58:20.067Z | channel | (1) dns:bogus.host Name resolution failed with calls queued for config selection
D 2022-04-12T19:58:20.067Z | call_stream | [0] cancelWithStatus code: 14 details: "Name resolution failed for target dns:bogus.host"
D 2022-04-12T19:58:20.067Z | call_stream | [0] ended with status: code=14 details="Name resolution failed for target dns:bogus.host"
Error: 14 UNAVAILABLE: Name resolution failed for target dns:bogus.host
at Object.callErrorFromStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client.js:180:52)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
at /Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
at processTicksAndRejections (internal/process/task_queues.js:77:11) {
code: 14,
details: 'Name resolution failed for target dns:bogus.host',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
request #1 finished
sleeping...
executing request #2
D 2022-04-12T19:58:21.071Z | channel | (1) dns:bogus.host createCall [1] method="/PingAPI/Ping", deadline=Infinity
D 2022-04-12T19:58:21.071Z | call_stream | [1] Sending metadata
D 2022-04-12T19:58:21.072Z | channel | (1) dns:bogus.host callRefTimer.ref | configSelectionQueue.length=1 pickQueue.length=0
D 2022-04-12T19:58:21.072Z | call_stream | [1] write() called with message of length 0
D 2022-04-12T19:58:21.072Z | call_stream | [1] end() called
D 2022-04-12T19:58:21.072Z | call_stream | [1] deferring writing data chunk of length 5
D 2022-04-12T19:58:22.568Z | resolving_load_balancer | dns:bogus.host TRANSIENT_FAILURE -> IDLE
D 2022-04-12T19:58:22.568Z | channel | (1) dns:bogus.host callRefTimer.unref | configSelectionQueue.length=1 pickQueue.length=0
D 2022-04-12T19:58:22.568Z | connectivity_state | (1) dns:bogus.host TRANSIENT_FAILURE -> IDLE
If the sleep duration on L52 is increased to something higher than the client’s backoff setting (e.g., increased to 5000ms
), the resolver is allowed to return to the IDLE
state and the second request fails immediately as expected just like the first request.
➜ grpc-js-dns-hang git:(master) ✗ npm run start
> grpc-js-dns-hang@1.0.0 start /Users/chris.karcher/src/care/grpc-js-dns-hang
> GRPC_VERBOSITY=DEBUG GRPC_TRACE=all node index.js
D 2022-04-12T19:58:53.273Z | index | Loading @grpc/grpc-js version 1.6.3
D 2022-04-12T19:58:53.316Z | resolving_load_balancer | dns:bogus.host IDLE -> IDLE
D 2022-04-12T19:58:53.317Z | connectivity_state | (1) dns:bogus.host IDLE -> IDLE
D 2022-04-12T19:58:53.317Z | dns_resolver | Resolver constructed for target dns:bogus.host
D 2022-04-12T19:58:53.318Z | channel | (1) dns:bogus.host Channel constructed with options {
"grpc.initial_reconnect_backoff_ms": 2500
}
D 2022-04-12T19:58:53.318Z | channel_stacktrace | (1) Channel constructed
at new ChannelImplementation (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/channel.js:189:23)
at new Client (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client.js:62:36)
at new ServiceClientImpl (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/make-client.js:58:5)
at Object.<anonymous> (/Users/chris.karcher/src/care/grpc-js-dns-hang/index.js:37:16)
at Module._compile (internal/modules/cjs/loader.js:1085:14)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
at Module.load (internal/modules/cjs/loader.js:950:32)
at Function.Module._load (internal/modules/cjs/loader.js:790:12)
at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
at internal/main/run_main_module.js:17:47
executing request #1
D 2022-04-12T19:58:53.320Z | channel | (1) dns:bogus.host createCall [0] method="/PingAPI/Ping", deadline=Infinity
D 2022-04-12T19:58:53.321Z | call_stream | [0] Sending metadata
D 2022-04-12T19:58:53.322Z | dns_resolver | Looking up DNS hostname bogus.host
D 2022-04-12T19:58:53.323Z | resolving_load_balancer | dns:bogus.host IDLE -> CONNECTING
D 2022-04-12T19:58:53.323Z | connectivity_state | (1) dns:bogus.host IDLE -> CONNECTING
D 2022-04-12T19:58:53.323Z | resolving_load_balancer | dns:bogus.host CONNECTING -> CONNECTING
D 2022-04-12T19:58:53.323Z | connectivity_state | (1) dns:bogus.host CONNECTING -> CONNECTING
D 2022-04-12T19:58:53.324Z | channel | (1) dns:bogus.host callRefTimer.ref | configSelectionQueue.length=1 pickQueue.length=0
D 2022-04-12T19:58:53.325Z | call_stream | [0] write() called with message of length 0
D 2022-04-12T19:58:53.326Z | call_stream | [0] end() called
D 2022-04-12T19:58:53.327Z | call_stream | [0] deferring writing data chunk of length 5
D 2022-04-12T19:58:53.328Z | dns_resolver | Resolution error for target dns:bogus.host: getaddrinfo ENOTFOUND bogus.host
D 2022-04-12T19:58:53.328Z | resolving_load_balancer | dns:bogus.host CONNECTING -> TRANSIENT_FAILURE
D 2022-04-12T19:58:53.328Z | channel | (1) dns:bogus.host callRefTimer.unref | configSelectionQueue.length=1 pickQueue.length=0
D 2022-04-12T19:58:53.328Z | connectivity_state | (1) dns:bogus.host CONNECTING -> TRANSIENT_FAILURE
D 2022-04-12T19:58:53.328Z | channel | (1) dns:bogus.host Name resolution failed with calls queued for config selection
D 2022-04-12T19:58:53.328Z | call_stream | [0] cancelWithStatus code: 14 details: "Name resolution failed for target dns:bogus.host"
D 2022-04-12T19:58:53.328Z | call_stream | [0] ended with status: code=14 details="Name resolution failed for target dns:bogus.host"
Error: 14 UNAVAILABLE: Name resolution failed for target dns:bogus.host
at Object.callErrorFromStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client.js:180:52)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
at /Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
at processTicksAndRejections (internal/process/task_queues.js:77:11) {
code: 14,
details: 'Name resolution failed for target dns:bogus.host',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
request #1 finished
sleeping...
D 2022-04-12T19:58:55.829Z | resolving_load_balancer | dns:bogus.host TRANSIENT_FAILURE -> IDLE
D 2022-04-12T19:58:55.830Z | connectivity_state | (1) dns:bogus.host TRANSIENT_FAILURE -> IDLE
executing request #2
D 2022-04-12T19:58:58.332Z | channel | (1) dns:bogus.host createCall [1] method="/PingAPI/Ping", deadline=Infinity
D 2022-04-12T19:58:58.332Z | call_stream | [1] Sending metadata
D 2022-04-12T19:58:58.332Z | dns_resolver | Looking up DNS hostname bogus.host
D 2022-04-12T19:58:58.333Z | resolving_load_balancer | dns:bogus.host IDLE -> CONNECTING
D 2022-04-12T19:58:58.333Z | connectivity_state | (1) dns:bogus.host IDLE -> CONNECTING
D 2022-04-12T19:58:58.333Z | resolving_load_balancer | dns:bogus.host CONNECTING -> CONNECTING
D 2022-04-12T19:58:58.333Z | connectivity_state | (1) dns:bogus.host CONNECTING -> CONNECTING
D 2022-04-12T19:58:58.333Z | channel | (1) dns:bogus.host callRefTimer.ref | configSelectionQueue.length=1 pickQueue.length=0
D 2022-04-12T19:58:58.333Z | call_stream | [1] write() called with message of length 0
D 2022-04-12T19:58:58.333Z | call_stream | [1] end() called
D 2022-04-12T19:58:58.333Z | call_stream | [1] deferring writing data chunk of length 5
D 2022-04-12T19:58:58.334Z | dns_resolver | Resolution error for target dns:bogus.host: getaddrinfo ENOTFOUND bogus.host
D 2022-04-12T19:58:58.334Z | resolving_load_balancer | dns:bogus.host CONNECTING -> TRANSIENT_FAILURE
D 2022-04-12T19:58:58.334Z | channel | (1) dns:bogus.host callRefTimer.unref | configSelectionQueue.length=1 pickQueue.length=0
D 2022-04-12T19:58:58.334Z | connectivity_state | (1) dns:bogus.host CONNECTING -> TRANSIENT_FAILURE
D 2022-04-12T19:58:58.334Z | channel | (1) dns:bogus.host Name resolution failed with calls queued for config selection
D 2022-04-12T19:58:58.334Z | call_stream | [1] cancelWithStatus code: 14 details: "Name resolution failed for target dns:bogus.host"
D 2022-04-12T19:58:58.334Z | call_stream | [1] ended with status: code=14 details="Name resolution failed for target dns:bogus.host"
Error: 14 UNAVAILABLE: Name resolution failed for target dns:bogus.host
at Object.callErrorFromStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client.js:180:52)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
at Object.onReceiveStatus (/Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
at /Users/chris.karcher/src/care/grpc-js-dns-hang/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
at processTicksAndRejections (internal/process/task_queues.js:77:11) {
code: 14,
details: 'Name resolution failed for target dns:bogus.host',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
finished
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
Thanks a lot. This issue really helped me solving my problems.
I published that change in version 1.6.4. Please try it out.