question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory leak and duration increase during training

See original GitHub issue

Description

During training on GPU I experience

  • 4 MiB memory leak on GPU per epoch (looks constant)
  • duration increase about 1 min per epoch (looks linear)

Expected Behavior

no memory leak, roughly constant duration per epoch

How to Reproduce?

I set up a toy app based on djl mnist to reproduce the problem I experience:

git clone https://github.com/enpasos/reproducebug1.git
cd reproducebug1
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

Environment Info

  • GPU: NVIDIA GeForce RTX 3090
  • CPU: AMD Ryzen 9 3950X 16-Core Processor
  • RAM: 64 GB
  • OS: Edition Windows 11 Pro, Version 22H2, Betriebssystembuild 22623.1020
  • GPU Driver: 522.25
  • CUDA SDK: 11.6.2
  • CUDNN: cudnn-windows-x86_64-8.5.0.96_cuda11
  • Java: Corretto-17.0.3.6.1
  • DJL: 0.21.0-SNAPSHOT (05.12.2022)
  • PYTORCH: 1.12.1

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
lanking520commented, Dec 9, 2022

Could you please help here @KexinFeng

0reactions
enpasoscommented, Dec 14, 2022

@KexinFeng @lanking520 what do you think? … wouldn’t it be nice for the user if she/he doesn’t have to take care about the above NDManager hierarchy subtleties … for me the best experience would be if it works simply like garbage collection: If you don’t need a resource anymore - when you no more reference it - it should be automatically closed.

Fortunatelly there is a way to implement this with relative little effort compared to the user experience gain and without messing up with the in most cases wonderfully running DJL. One could simply attach to standard JVM garbage collection. How?

  • instead of using resources like NDArray directly facade it with a dynamic proxy. This dynamic proxy only is only referenced by the user - for the user it should feel like the normal NDArray. If no references to the dynamic proxy are left, it is ready for JVM garbage collection.
  • the actual NDArray that is referenced from its NDManager could go in a WeakHashMap on any creation of a new NDArray.
  • the dynamic proxy only has an uuid key with which it on any method call retrieves the NDArray from the WeakHashMap.
  • To not interfer with the GPL license of the WeakHashMap.java code but to get a “callback” from garbage collection on the deletion of the dynamic proxy and the uuid key let’s use a separate ReferenceQueue with a WeakReference with the uuid key as the referent and the NDArray as a object property. When there are no more user references to the dynamic proxy the dynamic proxy and the uuid key become garbage collected by the JVM. And on the garbage collection the WeakReference is added to the ReferenceQueue by the JVM gabage collector.
  • On any call to the WeakHashMap or a wrapper to it the ReferenceQueue can be polled for the WeakReferences to the uuid keys being garbage collected. The corresponding resource e.g. NDArray that is an object property of each WeakReference can be closed.

In the last few hours I have created a small proof of concept. It works like a charm with little code.

git clone https://github.com/enpasos/poc1.git
cd poc1
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

If you like the approach I could set up - maybe with your help - an implementation in DJL as a pull request … to be on the save side with a switch to use the feature or not to use it at all.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dealing with memory leak issue in Keras model training
Recently, I was trying to train my keras (v2.4.3) model with tensorflow-gpu (v2.2.0) backend on NVIDIA's Tesla V100-DGXS-32GB.
Read more >
Potential memory leak when training? · Issue #3756 - GitHub
The memory that the trainer object holds on to on the heap does increase with each training iteration. It may be possible to...
Read more >
Memory Leak Detection Algorithms in the Cloud-based ... - arXiv
Abstract—A memory leak in an application deployed on the cloud can affect the availability and reliability of the application.
Read more >
What is Memory Leak? How can we avoid? - GeeksforGeeks
The consequences of memory leak is that it reduces the performance of the computer by reducing the amount of available memory.
Read more >
Possible memory leak while training deep neural networks
I am training a deep neural network (a CNN) using out-of-memory data. As usual, I am creating an "imageDatastore" and passing it to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found