OutOfMemoryError spins library out of control
See original GitHub issueIssue summary
nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler#close
may never complete if Thread was interrupted, which is always if scanning failed. See below for details.
Background
I was writing some unit tests for a really basic Neo4J graph, but while Neo4J-OGM was looking for my Entities during test setup, it just never completed. With some very roundabout debugging (all the async and OOM interplay was making me tear my hair out), I managed to narrow down the root cause to how classgraph works. I have 32G memory, and Java defaults to using 1/4th of memory per process, which means I cannot run much in parallel, so I have a global 256M override for each process. It works fine for Gradle, Neo4J, and other traditionally resource-hungry setups. It even works for my test setup most of the time, so for this report repro I lowered the memory limit to be consistent.
Repro Setup
trivial library usage:
// implementation "io.github.classgraph:classgraph:4.8.62"
ScanResult result = new ClassGraph()
.whitelistPackages("pack.age")
.verbose()
.scan()
;
plus a dependency with a nested jar in it:
runtimeOnly "org.neo4j:neo4j-lucene-upgrade:3.4.9"
non-trivial: resource constrained memory: -Xmx128M
(see build.gradle)
Full minimal repro can be found here: https://github.com/TWiStErRob/repros/tree/master/classgraph/oom-NestedJarHandler-spin
Repro steps
gradlew run
or
- Import project to IntellIJ
- Execute
Main.main
(should pass, depending on machine setup) - Edit Run Configuration to have VM options:
-Xmx128M
- Run or debug again (will fail, expected: print a line)
Pieces of the puzzle
What I found to be contributing to the issue.
Nested Jar files
runtimeOnly "org.neo4j:neo4j-lucene-upgrade:3.4.9"
an example from Neo4J/OGM, which contains !/lib/lucene-backward-codecs-5.5.0.jar This is required to trigger
} else {
// This path has one or more '!' sections.
code path in NestedJarHandler
.
Large allocation to trigger OOM
This is baked into the library. MappedByteBufferResources
allocates 64M per JAR file, which is quite large, considering there could be many jars like this on the classpath, there could a lot of allocations happening at the same time and consuming tons of memory. Note: I have 14 core/28 logical core processor, and the library runs on 39 threads. So if all those are processing nested JARs, 2.5G of memory is required. That doesn’t sound “ultra-lightweight” 😃
This allocation throws an OOM, which is caught by AutoCloseableExecutorService.afterExecute
and the thread is interrupted. I think this interrupt doesn’t hurt NestedJarHandler.close
, so it could be an irrelevant red herring to the issue. But this location came up during the investigation so I thought I would mention it.
Scanner.call
catches and cleans up
When an exception occurred (e.g. our OOM), removeTemporaryFilesAfterScan
is set to true
and threads are interrupted. finally
then comes and tries to close nestedJarHandler
. The relevant parts of the method:
@Override
public ScanResult call() throws InterruptedException, CancellationException, ExecutionException {
try {
scanResult = openClasspathElementsThenScan();
} catch (final Throwable e) {
// Since an exception was thrown, remove temporary files
removeTemporaryFilesAfterScan = true;
// Stop any running threads (should not be needed, threads should already be quiescent)
interruptionChecker.interrupt();
} finally {
if (removeTemporaryFilesAfterScan) {
// If removeTemporaryFilesAfterScan was set, remove temp files and close resources,
// zipfiles and modules
nestedJarHandler.close(topLevelLog);
}
}
return scanResult;
}
NestedJarHandler#close
spins
See numbered inline comments for explanation.
if (canonicalFileToPhysicalZipFileMap != null) {
// (4) spins out of control, because (3)
while (!canonicalFileToPhysicalZipFileMap.isEmpty()) {
try {
// (1) throws InterruptedException if thread was interrupted, which is always because Scanner error handling interrupted
for (final Entry<File, PhysicalZipFile> ent : canonicalFileToPhysicalZipFileMap.entries()) {
final PhysicalZipFile physicalZipFile = ent.getValue();
physicalZipFile.close();
// (3) never executed, because of ConcurrentHashMap's interruption in SingletonMap
canonicalFileToPhysicalZipFileMap.remove(ent.getKey());
}
} catch (final InterruptedException e) {
// (2) re-interrupts Thread
interruptionChecker.interrupt();
}
}
canonicalFileToPhysicalZipFileMap = null;
}
Proposed solutions
- Quick treatment with library gracefully executing to successful completion:
Catch OOM explicitly when allocatingbuf
inMappedByteBufferResources
and spill to disk if that happened. This would be in line with howIOException
frommakeTempFile
is caught. - Lower memory usage (could be combined with previous):
fastZipEntryToZipFileSliceMap
knows the size of the nested Jar file (childZipEntry), if that was passed down the decision can be immediately made to spill to disk before allocating and without reading the whole stream. - Don’t spin
In any case,NestedJarHandler.close
should probably be able to closecanonicalFileToPhysicalZipFileMap
even in an interrupted state, so that fail classpath scans can receive the exceptional termination and handle accordingly.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:18 (18 by maintainers)
Top GitHub Comments
After working around #403 by using JDK 9 for runtime, I can confirm that a complex Neo4J OGM, Kotlin, Ktor app runs with Xmx64M happily, and Xmx32M still runs into OOM, but doesn’t spin.
You’re welcome! That’s the goal. Especially when somebody files a bug that is as detailed and deeply researched as yours is!
I usually try to ensure any bug that is reported is fixed within 24 hours, and a new release pushed out – but this particular bug report has triggered a whole set of domino changes that affect ClassGraph at a very deep level, and it might take a few days to get all the changes in place. I’ll try to get some more of that done tomorrow.
Right, but I thought you were saying
-Xmx
sets the maximum limit for all allocations, both heap and memory mapped files? Or do memory mappings fall outside the-Xmx
limit (while still being limited, probably relative to RAM size)? (I guess I could just go test this, but it seems like you know the answer…)Yes, this was a major oversight, that I forgot to trim that array! I think that alone is the source of the sometimes dramatic drop in memory consumption you saw in your latest comment.
I assumed that zipping a zipfile rarely if ever resulted in further compression, but I guess if a zipfile already contained stored uncompressed content, it should be compressible. Or maybe some zipfiles contain central directories that are large enough that zip thinks it is worth it to zip the file for the compression gains. However I just looked at my
~/.m2/repository
cache, and there isn’t a single jar in there that contains a nested jar. So I think it is only web applications, Spring-Boot applications, etc. (with custom classloaders) that cause this issue to arise, and not libraries in general.Well that’s even more amazing then that you put so much time and effort into tracking down the source of the problem, when it was a dependency of a dependency that was causing the problem!
I always thought so before, but I’m not so sure anymore. Actually the new
RandomAccessFile
method only ever relies on operating system buffers, which will only ever use unused RAM, so the memory overhead is basically zero.Yes, and this is done lazily. But I didn’t know that OS file buffering worked so well with
RandomAccessFile
– I thought it would cause many additional disk reads, leading to greater overhead.I should test this on at least Windows before I totally pull the plug on
MappedByteBuffer
, but to be honest the mmap support in ClassGraph has caused a neverending series of headaches over the years, e.g. because without usingmisc.Unsafe
, you can’t forcibly unmap mapped byte buffers, you just have to wait for the GC to decide to collect them, then they’re freed – and if you don’t unmap them, Windows holds a file lock on them, which can cause files that were supposed to be deleted to be left behind. Anyway although ClassGraph seems to work flawlessly with mmap now across all three major operating systems (and it took a lot of work to get there), I won’t shed any tears ripping all of it out as long as the performance is not impacted.Thanks, I really appreciate the contribution!