Caffeine cache causing memory leak and OOM
See original GitHub issueDescription
Recently, we’ve updated our services to Java 17 (v 17.0.1), which also forced us to use the NewRelic java agent version 7.4.0.
After the migration, we’ve noticed our services regularly crash due to OOM. First, we’ve suspected some issue with Java 17, but after obtaining several heap dumps, it looks like it’s a problem with Caffeine caching introduced in 7.3.0 #258 (we’ve migrated from an older version of the agent).
We believe it might be related to the MariaDB driver we’re using. In one of the heapdumps, the com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.WS
map has 2.1 mil entries of prepared statements. It could be the case as it would rely on mariadb-java-client
implementation of PreparedStatement
which does not implement equals/hashCode
.
After doing two other heapdumps, one right after startup of the app and one an hour later, this is the comparison:
Looks, like the cache contained only PreparedStatement
entries.
Expected Behavior
Agent and service running smoothly, no memory leak.
Troubleshooting or NR Diag results
Steps to Reproduce
Your Environment
NewRelic agent: 7.4.0
Java: eclipse-temurin:17
docker image (currently 17.0.1
)
Spring Boot: 2.5.6
MariaDB driver: 2.7.4
(managed by Spring Boot BOM)
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Reactions:10
- Comments:61 (30 by maintainers)
Caffeine’s fix is released in v2.9.3 and v3.0.5. This will recover if the executor malfunctions, but affected users should apply a workaround for the JDK bug.
Today the application used less memory, though there still was a slight upward trend. I created two heap dumps; in the mean time the application was restarted for unrelated reasons. Key findings:
The
WS
instance referenced byExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache
that contains the prepared statements is now small (just 267776 and 37752 bytes retained, respectively), withkeyReferenceQueue.queueLength == 0
, as it should be. 🎉All other
ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache
instances are now also small (on the order of kilobytes). I went back to yesterday’s heap dump and saw that a few other such instances were also larger back then (on the order of megabytes).All in all: whatever the issue/bug is,
Caffeine.executor(Runnable::run)
avoids it. ✔️Two other
WS
instances that take up a significant amount of memory remain:JdbcHelper.connectionToURL
comprising 22.17% / 8.12% of the heap, with akeyReferenceQueue.queueLength
of 616446 / 102386.JdbcHelper.connectionToIdentifier
comprising 18.47% / 5.74% of the heap, with akeyReferenceQueue.queueLength
of 501032 / 76343.IIUC these caches are created by
AgentCollectionFactory.createConcurrentWeakKeyedMap
, for which we did not setCaffeine.executor(Runnable::run)
. It’s safe to say that these two caches explain why memory usage was still trending upwards today.As previously discussed, the next step would now be to test whether the issue is specific to Java 17, by downgrading to (e.g.) Java 16. Unfortunately my schedule today was even busier than expected, and tomorrow will certainly be no better. So instead I have started a different experiment that took me a bit less time to set up: I now updated the New Relic Agent to use
Executors.newCachedThreadPool()
rather thanRunnable::run
(this time also for the cache inAgentCollectionFactory
). This experiment should hopefully tell us whether (a) the issue is somehow related to any kind of scheduling/thread switching or (b) it is more likely to be specific toForkJoinPool.commonPool()
.Stay tuned 😃