Off-heap memory grows without bound using streaming api
See original GitHub issueHi all, My grpc-java server off-heap memory grows without bound.
gRPC version
- server: grpc-java 1.34.1
- client: grpc-java 1.34.1, gRPC-Swift 1.0.0
Other information
- Server-side streaming RPC for chat service.
- Connection over SSL, SSL offload is processed on load balancer.
- Load is not so high
- Under 100 RPCs/sec per server.
- Under 10 RPCs/sec per client.
- I didn’t apply flow-control.
- Total client count is limited; it doesn’t really differs that much between the time span.
- Server keepalive configurations
- keepalive-time: 30s
- keepalive-timeout: 5s
- max-connection-idle: 60s
- Every 3:00 at night I perform onComplete() against all the
StreamObserver
s on the server; Chat service is not served this moment. - Client randomly gets
INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR
from LB, so I guess there’ll be some broken connections left on server side.- I know this load balancer issue is unusual; It’s under inspection.
- Mobile clients abruptly disconnect from server for various reasons and reconnect to server; I guess there’ll be some broken connections left on server side here, too.
- Possibly weird connection misuse on clients might exist; i.e. unexpected multiple connections with streams
- I get
io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError
once memory reaches limit. Memory is configured with below:-Xmx1536m
-XX:MaxDirectMemorySize=3584m
Some observations
- Couldn’t find memory leak with option below
-Dio.grpc.netty.shaded.io.netty.leakDetection.level=PARANOID -Dio.netty.leakDetection.level=PARANOID
- Once the memory is grown it doesn’t drop even after the 3:00
onComplete()
event; I expect connection related resources will be freed after the connection being idle. - There was one moment memory dropped, with below error; And I guess this triggered the memory release.
io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
- As I’ve tested,
- After getting connected and the stream got open from the server-side, calling
onComplete()
after the keepalive period server sendsOUTBOUND GO_AWAY
.
2021-04-07T20:36:18.051+09:00 DEBUG 35024 --- [ grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND GO_AWAY: lastStreamId=2147483647 errorCode=0 length=8 bytes=6d61785f69646c65 2021-04-07T20:36:18.051+09:00 DEBUG 35024 --- [ grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND PING: ack=false bytes=40715087873 2021-04-07T20:36:18.060+09:00 DEBUG 35024 --- [ grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] INBOUND PING: ack=true bytes=40715087873 2021-04-07T20:36:18.061+09:00 DEBUG 35024 --- [ grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND GO_AWAY: lastStreamId=5 errorCode=0 length=8 bytes=6d61785f69646c65
- However when I open the connection and stream then kill the process abruptly, server never sends
OUTBOUND GOAWAY
.
- After getting connected and the stream got open from the server-side, calling
Here’re my questions:
- Is it sane not sending
OUTBOUND GO_AWAY
after abruptly killing the process? Do the resources get freed after the keepalive period or any other period of time? - What should be the reason memory getting not freed even the service is not in use and
StreamObserver
s allonCompleted()
? - If it’s all about unfreed broken connection resources, will
maxConnectionAge
withmaxConnectionAgeGrace
help this situation? With these options, I seeGO_AWAY
with normal connections however don’t see any log with abruptly disconnected connections. - I expect unreferenced
StreamObserver
objects and related resources in direct memory to be garbage collected. Should I explicitly callonCompleted()
oronError()
for resource release?; I’ve actually callingonError()
andonCompleted()
on allStreamObserver
s for termination however asking just in case. - Is there any chance this is caused from client’s grpc library?
Figure: Chat service was active between 4/3 12:00 and 16:30 and memory never freed. Chat service resumed at 4/4 16:00. Memory growth trend looks like exponential.
I’m sorry not providing any source code here, if you need a reproducible source I’ll get it ready; the original source is on production so I can’t provide it this moment.
Thanks in advance!
Also asked here: https://stackoverflow.com/q/67003312/5448419
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Off-heap memory grows without bound using streaming api
Server-side streaming RPC for chat service. Connection over SSL, SSL offload is processed on load balancer.
Read more >Creating a Java off-heap in-memory database - Oracle Blogs
My goal was to store a potentially large amount of data within that storage. The advantages: The storage system is fast; it uses...
Read more >How we find and fix OOM and memory leaks in Java Services
This post focuses on two of these issues: the OOM (out of memory) errors ... As the application grows in size and more...
Read more >Troubleshooting Problems With Native (Off-Heap) Memory in ...
This tutorial demonstrates troubleshooting methods for Native Memory in Java apps, including I/O threads, OutOfMemoryError, Native memory ...
Read more >Troubleshooting | Apache Flink
OutOfMemoryError: Direct buffer memory # ... The exception usually indicates that the JVM direct memory limit is too small or that there is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Unfortunately I won’t have time to analyze your heap dumps but I can provide general pointers and attempt to answer your questions:
I don’t have much information to answer that. But I think it is worth trying.
Looking at https://netty.io/wiki/reference-counted-objects.html it looks like the PARANOID level performs leak diagnostics for every single buffer so it is going to impact performance - I just don;'t know how much.
Not necessarily because the leak detector diagnostic should have kicked in and reported the leaks. You may also try calling
ResourceLeakDetector.setLevel()
to set the level.I’m leaving my 10-day memory usage with
maxConnectionAge
configuration which seems to have no leak, as I can’t deal with the hints provided above promptly. Will close this issue and get back when possible.Thanks!