Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory leak and duration increase during training

See original GitHub issue

Description

During training on GPU I experience

4 MiB memory leak on GPU per epoch (looks constant)
duration increase about 1 min per epoch (looks linear)

Expected Behavior

no memory leak, roughly constant duration per epoch

How to Reproduce?

I set up a toy app based on djl mnist to reproduce the problem I experience:

git clone https://github.com/enpasos/reproducebug1.git
cd reproducebug1
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

Environment Info

GPU: NVIDIA GeForce RTX 3090
CPU: AMD Ryzen 9 3950X 16-Core Processor
RAM: 64 GB
OS: Edition Windows 11 Pro, Version 22H2, Betriebssystembuild 22623.1020
GPU Driver: 522.25
CUDA SDK: 11.6.2
CUDNN: cudnn-windows-x86_64-8.5.0.96_cuda11
Java: Corretto-17.0.3.6.1
DJL: 0.21.0-SNAPSHOT (05.12.2022)
PYTORCH: 1.12.1

Issue Analytics

State:
Created 10 months ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

lanking520commented, Dec 9, 2022

Could you please help here @KexinFeng

0reactions

enpasoscommented, Dec 14, 2022

@KexinFeng @lanking520 what do you think? … wouldn’t it be nice for the user if she/he doesn’t have to take care about the above NDManager hierarchy subtleties … for me the best experience would be if it works simply like garbage collection: If you don’t need a resource anymore - when you no more reference it - it should be automatically closed.

Fortunatelly there is a way to implement this with relative little effort compared to the user experience gain and without messing up with the in most cases wonderfully running DJL. One could simply attach to standard JVM garbage collection. How?

instead of using resources like NDArray directly facade it with a dynamic proxy. This dynamic proxy only is only referenced by the user - for the user it should feel like the normal NDArray. If no references to the dynamic proxy are left, it is ready for JVM garbage collection.
the actual NDArray that is referenced from its NDManager could go in a WeakHashMap on any creation of a new NDArray.
the dynamic proxy only has an uuid key with which it on any method call retrieves the NDArray from the WeakHashMap.
To not interfer with the GPL license of the WeakHashMap.java code but to get a “callback” from garbage collection on the deletion of the dynamic proxy and the uuid key let’s use a separate ReferenceQueue with a WeakReference with the uuid key as the referent and the NDArray as a object property. When there are no more user references to the dynamic proxy the dynamic proxy and the uuid key become garbage collected by the JVM. And on the garbage collection the WeakReference is added to the ReferenceQueue by the JVM gabage collector.
On any call to the WeakHashMap or a wrapper to it the ReferenceQueue can be polled for the WeakReferences to the uuid keys being garbage collected. The corresponding resource e.g. NDArray that is an object property of each WeakReference can be closed.

In the last few hours I have created a small proof of concept. It works like a charm with little code.

git clone https://github.com/enpasos/poc1.git
cd poc1
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

If you like the approach I could set up - maybe with your help - an implementation in DJL as a pull request … to be on the save side with a switch to use the feature or not to use it at all.