Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Native image memory leaks?

See original GitHub issue

Describe the bug

Hi Quarkus team & community!

While taking a look again on the recent developments on Quarkus space and started reading a recently published book, initially I played with the native image by just keep pressing F5 (browser reload) on a simple GET /accounts and discovered that the memory was keep increasing.

So, just being genuinely curious about it, I started to do it in a more “standard” way, but simple enough to illustrate the point. See How to Reproduce section for all the figures.

The app is a bare bone simple service that has an API and returns some data from memory, no disk nor database. The complete example can be found here.

The “stress test” is done using cURL and it is this /tests/account_svc_local_stress_1.sh file. In all ll the screenshots below you’ll see there is no bottleneck anywhere else (script nor system).

Obviously, a native image is not that efficient as a JVM based running app, but still the concerning part is that it keeps increasing the memory usage although there are no real reasons to do it except of some memory leak in the framework internals or the JAX-RS implementation. And most probably the only solution for now would be to recycle the instances from time to time.

Expected behavior

No response

Actual behavior

No response

How to Reproduce?

Initially:

$ ./target/account_svc-1.0.0-SNAPSHOT-runner 
__  ____  __  _____   ___  __ ____  ______ 
 --/ __ \/ / / / _ | / _ \/ //_/ / / / __/ 
 -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \   
--\___\_\____/_/ |_/_/|_/_/|_|\____/___/   
2022-01-09 15:08:06,648 INFO  [io.quarkus] (main) account_svc 1.0.0-SNAPSHOT native (powered by Quarkus 2.6.1.Final) started in 0.034s. Listening on: http://0.0.0.0:8080
2022-01-09 15:08:06,648 INFO  [io.quarkus] (main) Profile prod activated. 
2022-01-09 15:08:06,648 INFO  [io.quarkus] (main) Installed features: [cdi, kubernetes, resteasy, resteasy-jsonb, smallrye-context-propagation, vertx]

It starts quickly and the memory usage is very low. Of course, it’s not that relevant in a grand perspective, but that’s the “starting point”.

After running a first test using ./account_svc_local_stress_1.sh -a http://localhost:8080/accounts the result is:

After running ./account_svc_local_stress_1.sh -r 10000 -a http://localhost:8080/accounts

Continue with another test that would take longer by sending 1 million requests using ./account_svc_local_stress_1.sh -r 1000000 -a http://localhost:8080/accounts.

During this one, sometimes the CPU usage increases a little bit for short durations (like 3 sec) from ~20 to ~26%, while the memory still increases in a very small steps but kinda consistent.

And the result is:

Running again the same 1M requests test and the result is:

Output of `uname -a` or `ver`

Linux dxps 5.15.8-76051508-generic #202112141040~1639505278~21.10~0ede46a SMP Tue Dec 14 22:38:29 U x86_64 x86_64 x86_64 GNU/Linux

Issue Analytics

State:
Created 2 years ago
Comments:12 (11 by maintainers)

Top GitHub Comments

3reactions

alexcheng1982commented, Jan 13, 2022

With native image, there is only a limited choices of GC: Serial GC (default), Epsilon GC (no-op), and G1 GC (GraalVM EE only). From the setup:

No GC configured -> Serial GC
No -Xmn config -> Young generation size is 256M
No -Xmx config -> 80% of the physical memory, very large

Note that GraalVM releases up to (and including) 21.3 use a different default configuration for the Serial GC with no survivor regions, a young generation that is limited to 256 MB - GraalVM doc

Based on the usage pattern, most of the objects are short-lived. The increase of memory size is caused by object allocations in the eden region of the young generation before reaching the limit (256M). After turning on the GC output (-XX:+PrintGC -XX:+VerboseGC), I can observe that when the young generation is full, a young collection can collect most of the space. After reaching the limit of young generation, the overall memory size won’t increase much.

[Heap policy parameters: 
  YoungGenerationSize: 268435456
      MaximumHeapSize: 13743895280
      MinimumHeapSize: 536870912
     AlignedChunkSize: 1048576
  LargeArrayThreshold: 131072]
[[364809794289492 GC: before  epoch: 1  cause: CollectOnAllocation]
[Incremental GC (CollectOnAllocation) 262144K->5120K, 0.0194930 secs]
 [364809813842245 GC: after   epoch: 1  cause: CollectOnAllocation  policy: by space and time  type: incremental
  collection time: 19493077 nanoSeconds]]
[[364871084354828 GC: before  epoch: 2  cause: CollectOnAllocation]
[Incremental GC (CollectOnAllocation) 267264K->6144K, 0.0043234 secs]
 [364871088699094 GC: after   epoch: 2  cause: CollectOnAllocation  policy: by space and time  type: incremental
  collection time: 4323478 nanoSeconds]]

When the size of young generation is reduced, like -Xmn32m, the overall memory size can also be reduced. So this is unlikely to be a memory leaking issue.

2reactions

dxpscommented, Jan 13, 2022

@galderz Okay, okay, I explicitly mentioned that I used Java 17, instead of java 11. Should I almost feel guilty of pretending to compare apples to apples? 😐 I’m not here to bash Quarkus, I even promoted internally to my employer and it’s being used for around 1.5 years now. Initially, I just wanted to raise a heads up about this issue here, and see if the community has a feedback. Ofc, I could use Eclipse MAT or something to analyze a heapdump taken during or after a test. I’ll try to find and spend some time understanding the why.

@alexcheng1982 Excellent insight! Thanks for sharing it!

Started another “primitive” testing (previously described) scenario in JVM mode using a Java 11, as the initial native image used during the build. The Java version being used is 11.0.13+8-Ubuntu-0ubuntu1.21.10.

❯ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -jar target/quarkus-app/quarkus-run.jar 
__  ____  __  _____   ___  __ ____  ______ 
 --/ __ \/ / / / _ | / _ \/ //_/ / / / __/ 
 -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \   
--\___\_\____/_/ |_/_/|_/_/|_|\____/___/   
2022-01-13 17:47:47,422 INFO  [io.quarkus] (main) account_svc 1.0.0-SNAPSHOT on JVM (powered by Quarkus 2.6.1.Final) started in 1.088s. Listening on: http://0.0.0.0:8080
2022-01-13 17:47:47,435 INFO  [io.quarkus] (main) Profile prod activated. 
2022-01-13 17:47:47,435 INFO  [io.quarkus] (main) Installed features: [cdi, kubernetes, resteasy, resteasy-jsonb, smallrye-context-propagation, vertx]

At fresh startup time (without any request), it looks like this:

1st set of 1M requests After 4-7 seconds, the memory usage quickly jumps up to 447.7 MB and it stays there for the rest of the test (that normally takes 6m 30s to complete). And the result:
2nd set of 1M requests After 4-7 seconds, the memory usage jumps up to 461.4 MB and it kinda stays there for the rest of the test. The result is:
3rd set of 1M requests After 4-7 seconds, the memory usage jumps up to 467.6 MB and it kinda stays there for the rest of the test. The result being:

I could continue, but I guess the pattern is predictable.

Now just taking another shot with the native image and some -Xmx, as rightly suggested by @gsmet and @alexcheng1982. Used ./target/account_svc-1.0.0-SNAPSHOT-runner -Xmx64m.

At startup, it takes 5.1 MB.
1st test:
- During the test, the memory usage was up and down just within 11 - 16 MB range.
- But the test took 61m 37s, waaay longer (than 6m 30s) to repeat it.
- The result being:

Without any other relevant actions, we could consider this issue as closed and get the lessons learned from it. Thanks again for the valuable feedback! 🙏