Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DNS storm with round robin load balancing in grpc-js

See original GitHub issue

Problem description

We are having some problems with the load balancing in grpc-js.

We are seeing uneven distribution of calls to our service pods, which ends up sometimes overloading some of them, while keeping others at very low load.

We think this might be because of the default load balancing strategy of “pick first”, so we tried enabling round robin but this caused a bunch of issues

The distribution, albeit more consistent, was still not uniform
The pods where the client was running started using 3 times the CPU and created a flurry of requests to our DNS

Any ideas how we could address this uneven distribution issue, and what could be wrong with load balancing?

Reproduction steps

Our (singleton) clients get instantiated with the DNS address of the service. The DNS returns the IP of all the available pods for the given service. We enable round robin load balancing by providing this configuration to the client:

'grpc.service_config': JSON.stringify({ loadBalancingConfig: [{ round_robin: {} }], })

There was no other change to the clients besides the lb config.

Environment

OS name, version and architecture: Debian GNU/Linux 10 (buster) x86
Node version: 14.17.6
Node installation method: yarn
Package name and version: 1.4.5

Additional context

When we tried to deploy the mentioned config change this the behavior we saw:   (the baseline is for ~100 pods, while the spike is for just 4 canary pods where a single client configuration was changed) CPU: CPU spike

DNS requests:   Screen Shot 2022-01-13 at 11 01 04 AM

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:13 (7 by maintainers)

Top GitHub Comments

2reactions

murgatroid99commented, Jan 14, 2022

I have published grpc-js 1.5.1 with some throttling on DNS requests. Can you try that out and see what impact it has?

2reactions

murgatroid99commented, Jan 13, 2022

I think I see what is happening here: the clients are failing to connect to some of the addresses returned by the DNS. Those connection failures trigger DNS re-resolution attempts, which do not back off in this situation. The lack of a backoff here is a bug that I will fix. The connection failures would also explain the uneven request distribution.

You can get logs with more information about what is happening here by setting the environment variables GRPC_TRACE=channel,round_robin,subchannel,dns_resolver and GRPC_VERBOSITY=DEBUG.