Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Off-heap memory grows without bound using streaming api

See original GitHub issue

Hi all, My grpc-java server off-heap memory grows without bound.

gRPC version

server: grpc-java 1.34.1
client: grpc-java 1.34.1, gRPC-Swift 1.0.0

Other information

Server-side streaming RPC for chat service.
Connection over SSL, SSL offload is processed on load balancer.
Load is not so high
- Under 100 RPCs/sec per server.
- Under 10 RPCs/sec per client.
- I didn’t apply flow-control.
Total client count is limited; it doesn’t really differs that much between the time span.
Server keepalive configurations
- keepalive-time: 30s
- keepalive-timeout: 5s
- max-connection-idle: 60s
Every 3:00 at night I perform onComplete() against all the StreamObservers on the server; Chat service is not served this moment.
Client randomly gets INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR from LB, so I guess there’ll be some broken connections left on server side.
- I know this load balancer issue is unusual; It’s under inspection.
Mobile clients abruptly disconnect from server for various reasons and reconnect to server; I guess there’ll be some broken connections left on server side here, too.
- Possibly weird connection misuse on clients might exist; i.e. unexpected multiple connections with streams
I get io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError once memory reaches limit. Memory is configured with below:
- -Xmx1536m
- -XX:MaxDirectMemorySize=3584m

Some observations

Couldn’t find memory leak with option below
- -Dio.grpc.netty.shaded.io.netty.leakDetection.level=PARANOID -Dio.netty.leakDetection.level=PARANOID
Once the memory is grown it doesn’t drop even after the 3:00 onComplete() event; I expect connection related resources will be freed after the connection being idle.
There was one moment memory dropped, with below error; And I guess this triggered the memory release.
- io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place

As I’ve tested,

After getting connected and the stream got open from the server-side, calling onComplete() after the keepalive period server sends OUTBOUND GO_AWAY.

2021-04-07T20:36:18.051+09:00 DEBUG 35024 --- [       grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND GO_AWAY: lastStreamId=2147483647 errorCode=0 length=8 bytes=6d61785f69646c65
2021-04-07T20:36:18.051+09:00 DEBUG 35024 --- [       grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND PING: ack=false bytes=40715087873
2021-04-07T20:36:18.060+09:00 DEBUG 35024 --- [       grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] INBOUND PING: ack=true bytes=40715087873
2021-04-07T20:36:18.061+09:00 DEBUG 35024 --- [       grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND GO_AWAY: lastStreamId=5 errorCode=0 length=8 bytes=6d61785f69646c65

However when I open the connection and stream then kill the process abruptly, server never sends OUTBOUND GOAWAY.

Here’re my questions:

Is it sane not sending OUTBOUND GO_AWAY after abruptly killing the process? Do the resources get freed after the keepalive period or any other period of time?
What should be the reason memory getting not freed even the service is not in use and StreamObservers all onCompleted()?
If it’s all about unfreed broken connection resources, will maxConnectionAge with maxConnectionAgeGrace help this situation? With these options, I see GO_AWAY with normal connections however don’t see any log with abruptly disconnected connections.
I expect unreferenced StreamObserver objects and related resources in direct memory to be garbage collected. Should I explicitly call onCompleted() or onError() for resource release?; I’ve actually calling onError() and onCompleted() on all StreamObservers for termination however asking just in case.
Is there any chance this is caused from client’s grpc library?

Figure: Chat service was active between 4/3 12:00 and 16:30 and memory never freed. Chat service resumed at 4/4 16:00. Memory growth trend looks like exponential. grpc-mem-not-freed

I’m sorry not providing any source code here, if you need a reproducible source I’ll get it ready; the original source is on production so I can’t provide it this moment.

Thanks in advance!

Also asked here: https://stackoverflow.com/q/67003312/5448419

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

sanjaypujarecommented, Apr 20, 2021

Unfortunately I won’t have time to analyze your heap dumps but I can provide general pointers and attempt to answer your questions:

Would there be a possibility getting different result from the production server?

I don’t have much information to answer that. But I think it is worth trying.

Does it have a big impact on performace turning the option on?

Looking at https://netty.io/wiki/reference-counted-objects.html it looks like the PARANOID level performs leak diagnostics for every single buffer so it is going to impact performance - I just don;'t know how much.

Does this mean there’s a leak?

Not necessarily because the leak detector diagnostic should have kicked in and reported the leaks. You may also try calling ResourceLeakDetector.setLevel() to set the level.

0reactions

gunjasalcommented, Apr 26, 2021

I’m leaving my 10-day memory usage with maxConnectionAge configuration which seems to have no leak, as I can’t deal with the hints provided above promptly. Will close this issue and get back when possible.

Thanks!