memory leak and duration increase during training
See original GitHub issueDescription
During training on GPU I experience
- 4 MiB memory leak on GPU per epoch (looks constant)
- duration increase about 1 min per epoch (looks linear)
Expected Behavior
no memory leak, roughly constant duration per epoch
How to Reproduce?
I set up a toy app based on djl mnist to reproduce the problem I experience:
git clone https://github.com/enpasos/reproducebug1.git
cd reproducebug1
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar
Environment Info
- GPU: NVIDIA GeForce RTX 3090
- CPU: AMD Ryzen 9 3950X 16-Core Processor
- RAM: 64 GB
- OS: Edition Windows 11 Pro, Version 22H2, Betriebssystembuild 22623.1020
- GPU Driver: 522.25
- CUDA SDK: 11.6.2
- CUDNN: cudnn-windows-x86_64-8.5.0.96_cuda11
- Java: Corretto-17.0.3.6.1
- DJL: 0.21.0-SNAPSHOT (05.12.2022)
- PYTORCH: 1.12.1
Issue Analytics
- State:
- Created 10 months ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
Dealing with memory leak issue in Keras model training
Recently, I was trying to train my keras (v2.4.3) model with tensorflow-gpu (v2.2.0) backend on NVIDIA's Tesla V100-DGXS-32GB.
Read more >Potential memory leak when training? · Issue #3756 - GitHub
The memory that the trainer object holds on to on the heap does increase with each training iteration. It may be possible to...
Read more >Memory Leak Detection Algorithms in the Cloud-based ... - arXiv
Abstract—A memory leak in an application deployed on the cloud can affect the availability and reliability of the application.
Read more >What is Memory Leak? How can we avoid? - GeeksforGeeks
The consequences of memory leak is that it reduces the performance of the computer by reducing the amount of available memory.
Read more >Possible memory leak while training deep neural networks
I am training a deep neural network (a CNN) using out-of-memory data. As usual, I am creating an "imageDatastore" and passing it to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Could you please help here @KexinFeng
@KexinFeng @lanking520 what do you think? … wouldn’t it be nice for the user if she/he doesn’t have to take care about the above NDManager hierarchy subtleties … for me the best experience would be if it works simply like garbage collection: If you don’t need a resource anymore - when you no more reference it - it should be automatically closed.
Fortunatelly there is a way to implement this with relative little effort compared to the user experience gain and without messing up with the in most cases wonderfully running DJL. One could simply attach to standard JVM garbage collection. How?
In the last few hours I have created a small proof of concept. It works like a charm with little code.
If you like the approach I could set up - maybe with your help - an implementation in DJL as a pull request … to be on the save side with a switch to use the feature or not to use it at all.