DNS storm with round robin load balancing in grpc-js
See original GitHub issueProblem description
We are having some problems with the load balancing in grpc-js
.
We are seeing uneven distribution of calls to our service pods, which ends up sometimes overloading some of them, while keeping others at very low load.
We think this might be because of the default load balancing strategy of “pick first”, so we tried enabling round robin but this caused a bunch of issues
- The distribution, albeit more consistent, was still not uniform
- The pods where the client was running started using 3 times the CPU and created a flurry of requests to our DNS
Any ideas how we could address this uneven distribution issue, and what could be wrong with load balancing?
Reproduction steps
Our (singleton) clients get instantiated with the DNS address of the service. The DNS returns the IP of all the available pods for the given service. We enable round robin load balancing by providing this configuration to the client:
'grpc.service_config': JSON.stringify({ loadBalancingConfig: [{ round_robin: {} }], })
There was no other change to the clients besides the lb config.
Environment
- OS name, version and architecture: Debian GNU/Linux 10 (buster) x86
- Node version: 14.17.6
- Node installation method: yarn
- Package name and version: 1.4.5
Additional context
When we tried to deploy the mentioned config change this the behavior we saw: (the baseline is for ~100 pods, while the spike is for just 4 canary pods where a single client configuration was changed) CPU:
DNS requests:
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:13 (7 by maintainers)
Top GitHub Comments
I have published grpc-js 1.5.1 with some throttling on DNS requests. Can you try that out and see what impact it has?
I think I see what is happening here: the clients are failing to connect to some of the addresses returned by the DNS. Those connection failures trigger DNS re-resolution attempts, which do not back off in this situation. The lack of a backoff here is a bug that I will fix. The connection failures would also explain the uneven request distribution.
You can get logs with more information about what is happening here by setting the environment variables
GRPC_TRACE=channel,round_robin,subchannel,dns_resolver
andGRPC_VERBOSITY=DEBUG
.