question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Off-heap memory grows without bound using streaming api

See original GitHub issue

Hi all, My grpc-java server off-heap memory grows without bound.

gRPC version

  • server: grpc-java 1.34.1
  • client: grpc-java 1.34.1, gRPC-Swift 1.0.0

Other information

  • Server-side streaming RPC for chat service.
  • Connection over SSL, SSL offload is processed on load balancer.
  • Load is not so high
    • Under 100 RPCs/sec per server.
    • Under 10 RPCs/sec per client.
    • I didn’t apply flow-control.
  • Total client count is limited; it doesn’t really differs that much between the time span.
  • Server keepalive configurations
    • keepalive-time: 30s
    • keepalive-timeout: 5s
    • max-connection-idle: 60s
  • Every 3:00 at night I perform onComplete() against all the StreamObservers on the server; Chat service is not served this moment.
  • Client randomly gets INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR from LB, so I guess there’ll be some broken connections left on server side.
    • I know this load balancer issue is unusual; It’s under inspection.
  • Mobile clients abruptly disconnect from server for various reasons and reconnect to server; I guess there’ll be some broken connections left on server side here, too.
    • Possibly weird connection misuse on clients might exist; i.e. unexpected multiple connections with streams
  • I get io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError once memory reaches limit. Memory is configured with below:
    • -Xmx1536m
    • -XX:MaxDirectMemorySize=3584m

Some observations

  • Couldn’t find memory leak with option below
    • -Dio.grpc.netty.shaded.io.netty.leakDetection.level=PARANOID -Dio.netty.leakDetection.level=PARANOID
  • Once the memory is grown it doesn’t drop even after the 3:00 onComplete() event; I expect connection related resources will be freed after the connection being idle.
  • There was one moment memory dropped, with below error; And I guess this triggered the memory release.
    • io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
  • As I’ve tested,
    • After getting connected and the stream got open from the server-side, calling onComplete() after the keepalive period server sends OUTBOUND GO_AWAY.
    2021-04-07T20:36:18.051+09:00 DEBUG 35024 --- [       grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND GO_AWAY: lastStreamId=2147483647 errorCode=0 length=8 bytes=6d61785f69646c65
    2021-04-07T20:36:18.051+09:00 DEBUG 35024 --- [       grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND PING: ack=false bytes=40715087873
    2021-04-07T20:36:18.060+09:00 DEBUG 35024 --- [       grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] INBOUND PING: ack=true bytes=40715087873
    2021-04-07T20:36:18.061+09:00 DEBUG 35024 --- [       grpc-nio-worker-ELG-3-4] [ i.g.n.s.i.grpc.netty.NettyServerHandler: 214] : [id: 0x333386d4, L:/172.20.40.169:31105 - R:/172.20.40.169:50188] OUTBOUND GO_AWAY: lastStreamId=5 errorCode=0 length=8 bytes=6d61785f69646c65
    
    • However when I open the connection and stream then kill the process abruptly, server never sends OUTBOUND GOAWAY.

Here’re my questions:

  1. Is it sane not sending OUTBOUND GO_AWAY after abruptly killing the process? Do the resources get freed after the keepalive period or any other period of time?
  2. What should be the reason memory getting not freed even the service is not in use and StreamObservers all onCompleted()?
  3. If it’s all about unfreed broken connection resources, will maxConnectionAge with maxConnectionAgeGrace help this situation? With these options, I see GO_AWAY with normal connections however don’t see any log with abruptly disconnected connections.
  4. I expect unreferenced StreamObserver objects and related resources in direct memory to be garbage collected. Should I explicitly call onCompleted() or onError() for resource release?; I’ve actually calling onError() and onCompleted() on all StreamObservers for termination however asking just in case.
  5. Is there any chance this is caused from client’s grpc library?

Figure: Chat service was active between 4/3 12:00 and 16:30 and memory never freed. Chat service resumed at 4/4 16:00. Memory growth trend looks like exponential. grpc-mem-not-freed

I’m sorry not providing any source code here, if you need a reproducible source I’ll get it ready; the original source is on production so I can’t provide it this moment.

Thanks in advance!

Also asked here: https://stackoverflow.com/q/67003312/5448419

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sanjaypujarecommented, Apr 20, 2021

Unfortunately I won’t have time to analyze your heap dumps but I can provide general pointers and attempt to answer your questions:

Would there be a possibility getting different result from the production server?

I don’t have much information to answer that. But I think it is worth trying.

Does it have a big impact on performace turning the option on?

Looking at https://netty.io/wiki/reference-counted-objects.html it looks like the PARANOID level performs leak diagnostics for every single buffer so it is going to impact performance - I just don;'t know how much.

Does this mean there’s a leak?

Not necessarily because the leak detector diagnostic should have kicked in and reported the leaks. You may also try calling ResourceLeakDetector.setLevel() to set the level.

0reactions
gunjasalcommented, Apr 26, 2021

I’m leaving my 10-day memory usage with maxConnectionAge configuration which seems to have no leak, as I can’t deal with the hints provided above promptly. Will close this issue and get back when possible.

Thanks!

10days-mem-usage
Read more comments on GitHub >

github_iconTop Results From Across the Web

Off-heap memory grows without bound using streaming api
Server-side streaming RPC for chat service. Connection over SSL, SSL offload is processed on load balancer.
Read more >
Creating a Java off-heap in-memory database - Oracle Blogs
My goal was to store a potentially large amount of data within that storage. The advantages: The storage system is fast; it uses...
Read more >
How we find and fix OOM and memory leaks in Java Services
This post focuses on two of these issues: the OOM (out of memory) errors ... As the application grows in size and more...
Read more >
Troubleshooting Problems With Native (Off-Heap) Memory in ...
This tutorial demonstrates troubleshooting methods for Native Memory in Java apps, including I/O threads, OutOfMemoryError, Native memory ...
Read more >
Troubleshooting | Apache Flink
OutOfMemoryError: Direct buffer memory # ... The exception usually indicates that the JVM direct memory limit is too small or that there is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found