Potential memory leak in resolver-dns
See original GitHub issueProblem Description Previously, we had an issue where upgrading from @grpc/grpc-js from 1.3.x to 1.5.x introduced a channelz memory leak (fixed in this issue for 1.5.10)
Upgrading to 1.5.10 locally seems to be fine and I have noticed no issues. However, when we upgraded our staging/production environments, a memory leak seems to come back with the only difference being updating from @grpc/grpc-js 1.3.x to 1.5.10.
Using Datadog’s continuous profiler, I wasn’t sure if this was the root issue, but there is definitely a growing heap.
Again, we are running a production service with a single grpc-js server that creates multiple grpc-js clients. The clients are created and destroyed using lightning-pool.
Channelz is disabled when we initialize the server/clients with 'grpc.enable_channelz': 0
(for server and clients)
Reproduction Steps The reproduction steps is still the same as before, except I guess this time the service is under staging/production load?
Create a single grpc-js server that calls grpc-js clients as needed from a pool resource with channelz disabled. In our case, the server is running and when requests are made, we acquire a client via the pool (factory created once as a singleton) to make a request. These should be able to handle concurrent/multiple requests.
Environment
- OS Name: macOS (locally testing) and running on AWS EKS clusters (production)
- Node Version: 14.16.0
- Package Name and Version: @grpc/grpc-js@1.5.10
Additional Context
Checking out the profiler with Heap Live Size
, it looks like there is a growing heap size for backoff-timeout.js
, resolver-dns.js
, load-balancer-child-handler.js
, load-balancer-round-robin.js
and channel.ts
. I let it run for about 2.5 hours and I am comparing the heap profiles from the first 30mins and the last 30 minutes to see what has changed.
When comparing with @grpc/grpc-js@1.3.x, these look like they aren’t used.
I see that 1.6.x made some updates to some timers, was wondering if it could be related?
Happy to provide more context or help as needed.
NOTE: Clarifying the graph, the start/end time of the problem starts within the highlighted intervals. Everything else is from a different process and rolling the package back.
(Detail view of the other red section from above)
Issue Analytics
- State:
- Created a year ago
- Comments:21 (12 by maintainers)
Top GitHub Comments
The requested tests have been added in #2105.
@sam-la-compass Can you check if the latest version of grpc-js fixes the original bug for you?
In the third image, the tooltip for “addTrace (channelz.js)” seems to be covering up the information about the top three contributors to the heap size. Can you say what those top three items are or share another screenshot that shows them? The top one in particular seems to be a very large fraction of the heap.
I think I can partially explain the failed DNS requests: those addresses look like they are supposed to be an IPv6 address plus a port, but the syntax is wrong: an IPv6 address needs to be enclosed in square brackets (
[]
) to use a port with it. For example, the proper syntax to represent the address in the top log is[2a00:1450:400f:801::200a]:443
. You can fix that if you know what the source of those addresses is, but I am not sure why gRPC would still not treat it as an IPv6 address anyway.