question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OutOfMemoryError spins library out of control

See original GitHub issue

Issue summary

nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler#close may never complete if Thread was interrupted, which is always if scanning failed. See below for details.

Background

I was writing some unit tests for a really basic Neo4J graph, but while Neo4J-OGM was looking for my Entities during test setup, it just never completed. With some very roundabout debugging (all the async and OOM interplay was making me tear my hair out), I managed to narrow down the root cause to how classgraph works. I have 32G memory, and Java defaults to using 1/4th of memory per process, which means I cannot run much in parallel, so I have a global 256M override for each process. It works fine for Gradle, Neo4J, and other traditionally resource-hungry setups. It even works for my test setup most of the time, so for this report repro I lowered the memory limit to be consistent.

Repro Setup

trivial library usage:

	// implementation "io.github.classgraph:classgraph:4.8.62"
	ScanResult result = new ClassGraph()
			.whitelistPackages("pack.age")
			.verbose()
			.scan()
			;

plus a dependency with a nested jar in it:

runtimeOnly "org.neo4j:neo4j-lucene-upgrade:3.4.9"

non-trivial: resource constrained memory: -Xmx128M (see build.gradle)

Full minimal repro can be found here: https://github.com/TWiStErRob/repros/tree/master/classgraph/oom-NestedJarHandler-spin

Repro steps

  1. gradlew run

or

  1. Import project to IntellIJ
  2. Execute Main.main (should pass, depending on machine setup)
  3. Edit Run Configuration to have VM options: -Xmx128M
  4. Run or debug again (will fail, expected: print a line)

gradlew run.log

Pieces of the puzzle

What I found to be contributing to the issue.

Nested Jar files

runtimeOnly "org.neo4j:neo4j-lucene-upgrade:3.4.9"

an example from Neo4J/OGM, which contains !/lib/lucene-backward-codecs-5.5.0.jar This is required to trigger

                    } else {
                        // This path has one or more '!' sections.

code path in NestedJarHandler.

Large allocation to trigger OOM

This is baked into the library. MappedByteBufferResources allocates 64M per JAR file, which is quite large, considering there could be many jars like this on the classpath, there could a lot of allocations happening at the same time and consuming tons of memory. Note: I have 14 core/28 logical core processor, and the library runs on 39 threads. So if all those are processing nested JARs, 2.5G of memory is required. That doesn’t sound “ultra-lightweight” 😃

This allocation throws an OOM, which is caught by AutoCloseableExecutorService.afterExecute and the thread is interrupted. I think this interrupt doesn’t hurt NestedJarHandler.close, so it could be an irrelevant red herring to the issue. But this location came up during the investigation so I thought I would mention it.

Scanner.call catches and cleans up

When an exception occurred (e.g. our OOM), removeTemporaryFilesAfterScan is set to true and threads are interrupted. finally then comes and tries to close nestedJarHandler. The relevant parts of the method:

    @Override
    public ScanResult call() throws InterruptedException, CancellationException, ExecutionException {
        try {
            scanResult = openClasspathElementsThenScan();
        } catch (final Throwable e) {
            // Since an exception was thrown, remove temporary files
            removeTemporaryFilesAfterScan = true;

            // Stop any running threads (should not be needed, threads should already be quiescent)
            interruptionChecker.interrupt();
        } finally {
            if (removeTemporaryFilesAfterScan) {
                // If removeTemporaryFilesAfterScan was set, remove temp files and close resources,
                // zipfiles and modules
                nestedJarHandler.close(topLevelLog);
            }
        }
        return scanResult;
    }

NestedJarHandler#close spins

See numbered inline comments for explanation.

            if (canonicalFileToPhysicalZipFileMap != null) {
                // (4) spins out of control, because (3)
                while (!canonicalFileToPhysicalZipFileMap.isEmpty()) {
                    try {
                        // (1) throws InterruptedException if thread was interrupted, which is always because Scanner error handling interrupted
                        for (final Entry<File, PhysicalZipFile> ent : canonicalFileToPhysicalZipFileMap.entries()) {
                            final PhysicalZipFile physicalZipFile = ent.getValue();
                            physicalZipFile.close();
                            // (3) never executed, because of ConcurrentHashMap's interruption in SingletonMap
                            canonicalFileToPhysicalZipFileMap.remove(ent.getKey());
                        }
                    } catch (final InterruptedException e) {
                        // (2) re-interrupts Thread
                        interruptionChecker.interrupt();
                    }
                }
                canonicalFileToPhysicalZipFileMap = null;
            }

Proposed solutions

  1. Quick treatment with library gracefully executing to successful completion:
    Catch OOM explicitly when allocating buf in MappedByteBufferResources and spill to disk if that happened. This would be in line with how IOException from makeTempFile is caught.
  2. Lower memory usage (could be combined with previous):
    fastZipEntryToZipFileSliceMap knows the size of the nested Jar file (childZipEntry), if that was passed down the decision can be immediately made to spill to disk before allocating and without reading the whole stream.
  3. Don’t spin
    In any case, NestedJarHandler.close should probably be able to close canonicalFileToPhysicalZipFileMap even in an interrupted state, so that fail classpath scans can receive the exceptional termination and handle accordingly.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:18 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
TWiStErRobcommented, Feb 16, 2020

After working around #403 by using JDK 9 for runtime, I can confirm that a complex Neo4J OGM, Kotlin, Ktor app runs with Xmx64M happily, and Xmx32M still runs into OOM, but doesn’t spin.

1reaction
lukehutchcommented, Feb 10, 2020

Wow, lot of info 😃 And also I’m not used to getting fixes (let alone replies) this fast. Well done!

You’re welcome! That’s the goal. Especially when somebody files a bug that is as detailed and deeply researched as yours is!

I usually try to ensure any bug that is reported is fixed within 24 hours, and a new release pushed out – but this particular bug report has triggered a whole set of domino changes that affect ClassGraph at a very deep level, and it might take a few days to get all the changes in place. I’ll try to get some more of that done tomorrow.

I hadn’t considered though that Java limits not just the resident size but also the entire accessible virtual memory size

Yes, Java OOM means there’s no more JVM heap available. JVMs always have an upper limit for memory (Runtime.getRuntime().maxMemory()), you cannot new things more than that. All that virtual memory and paging only applies to native allocations. If anyone, lib or app, does a new during runtime that’ll chip away from the Xmx.

Right, but I thought you were saying -Xmx sets the maximum limit for all allocations, both heap and memory mapped files? Or do memory mappings fall outside the -Xmx limit (while still being limited, probably relative to RAM size)? (I guess I could just go test this, but it seems like you know the answer…)

I wholeheartedly support this, but for a 1k nested JAR there is no need to allocate and consume 64MB of Java heap. Which you fixed in 037983e.

Yes, this was a major oversight, that I forgot to trim that array! I think that alone is the source of the sometimes dramatic drop in memory consumption you saw in your latest comment.

Yep, I checked inside neo4j-lucene-upgrade-3.4.9.jar, it has 3 jar files inside its lib folder. ZIP-ing a ZIP always gives a bit of extra gain. In this case it’s 9% (500k) deflated vs. stored version. In principle I agree, JARs should always store inner JARs to be able to use optimizations for classpath like you did.

I assumed that zipping a zipfile rarely if ever resulted in further compression, but I guess if a zipfile already contained stored uncompressed content, it should be compressible. Or maybe some zipfiles contain central directories that are large enough that zip thinks it is worth it to zip the file for the compression gains. However I just looked at my ~/.m2/repository cache, and there isn’t a single jar in there that contains a nested jar. So I think it is only web applications, Spring-Boot applications, etc. (with custom classloaders) that cause this issue to arise, and not libraries in general.

Note: This alone wouldn’t have helped, because I’m not using this library, Neo4J OGM does and I don’t have access to ClassGraph builder. Luckily using the hint will help a lot in this case.

Well that’s even more amazing then that you put so much time and effort into tracking down the source of the problem, when it was a dependency of a dependency that was causing the problem!

Memory mapping sounds like the optimal way to go, as it’s handled natively, doesn’t copy data (saves JVM heap and native memory).

I always thought so before, but I’m not so sure anymore. Actually the new RandomAccessFile method only ever relies on operating system buffers, which will only ever use unused RAM, so the memory overhead is basically zero.

I guess in the end memory mapping still has to read from disk when the memory is accessed and because I/O is magnitude slower than RAM you won’t see a difference in processing speed. The win is the change in amount and location of allocated memory.

Yes, and this is done lazily. But I didn’t know that OS file buffering worked so well with RandomAccessFile – I thought it would cause many additional disk reads, leading to greater overhead.

I should test this on at least Windows before I totally pull the plug on MappedByteBuffer, but to be honest the mmap support in ClassGraph has caused a neverending series of headaches over the years, e.g. because without using misc.Unsafe, you can’t forcibly unmap mapped byte buffers, you just have to wait for the GC to decide to collect them, then they’re freed – and if you don’t unmap them, Windows holds a file lock on them, which can cause files that were supposed to be deleted to be left behind. Anyway although ClassGraph seems to work flawlessly with mmap now across all three major operating systems (and it took a lot of work to get there), I won’t shed any tears ripping all of it out as long as the performance is not impacted.

The last two are artificial files with 128 nested JAR files, that I can’t attach here 😦 I know it’s unrealistic, but it proves your fix working, and it is working really well! See also PR #401.

Thanks, I really appreciate the contribution!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting OutOfMemoryError: Direct buffer memory
As we said, the Direct Buffers are allocated to native memory space outside of the JVM's established heap/perm gens. If this memory space ......
Read more >
OutOfMemoryError when creating AmazonS3Client in Lambda
It would seem when creating any of the AWS clients (Sync or Async) you may get out of Metaspace. I believe this is...
Read more >
VBA Out of Memory Error - Automate Excel
It comes loaded with code generators, an extensive code library, the ability to create your own code library, and many other helpful time-saving ......
Read more >
Java heap space out of memory problem - MATLAB Answers
Java heap space out of memory problem. ... I've tried to adjust Java Heap memory and disable Control Integration and Matlab ... GNU...
Read more >
3.2 Understand the OutOfMemoryError Exception
Action: Increase the heap size. The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:- ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found