OOM at 48GB GPU mem on GPT-2 inference due to memory leak in directML.
See original GitHub issueFirst of all thank you for all your work, it’s very exsiting to see windows/amd ML gap being closed! Issue is:
git clone https://github.com/openai/gpt-2.git
cd gpt-2
python -m pip install -r requirements.txt
python3 download_model.py 1558M
python src/interactive_conditional_samples.py 1558M
Given any text for inference, repetedly consumes all 48GB GPU mem (AMD Radeon VII 16GB + 32GB shared memory) and falls with:
(0) Resource exhausted: OOM when allocating tensor with shape[1,48,2,25,455,64] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
[[node sample_sequence/while/concat (defined at F:\DSML\Soft\Anaconda\envs\directml\lib\site-packages\tensorflow_core\python\framework\ops.py:1762) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Full log and env in the comments below. Probably related with the resource releasing issue on
from ai_benchmark import AIBenchmark
benchmark = AIBenchmark(use_CPU=None, verbose_level=1)
results = benchmark.run()
which also OOM falls during execution. Mem for tensors is not released after runs and even after sess.close()
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Pytorch model training CPU Memory leak issue - Stack Overflow
Viewing log file from memory profiler, I find when column wise -= operation occurred, my CPU memory gradually increased until OOM killer ...
Read more >C | onnxruntime
Running a model with inputs. These inputs must be in CPU memory, not GPU. If the model has multiple outputs, user can specify...
Read more >The Best GPUs for Deep Learning in 2020 - Tim Dettmers
I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning...
Read more >GPU memory leak when using tensorrt with onnx model
Description. GPU memory keeps increasing when running tensorrt inference in a for loop. Environment. TensorRT Version: 7.0.0.11
Read more >pytorch-memlab - PyPI
pytorch_memlab · Memory Profiler: A line_profiler style CUDA memory profiler with simple API. · Memory Reporter: A reporter to inspect tensors occupying the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@Herobring , we just released tensorflow-directml 1.15.3.dev200911 with many improvements to the memory allocator. You can try it out and tell us how it goes!
Also, since we have now open-sourced our fork, new tensorflow-directml issues should be opened over here.
Thank you for your interest in making tensorflow-directml better @Herobring . This is an area of improvement we’re actively looking at and every data point is appreciated. Our end goal is to dramatically improve our memory usage patterns and make them closer to what people expect from CUDA or ROCm devices, but we’re not there yet as we have been focusing more on operator coverage in the past.
I’ll make sure to update this issue with a comment once we have made progress on this!