[BUG] memory overhead issue with optimizer leading to OOM
See original GitHub issueCould it be that unfused_optimizer
is not careful at how it allocates memory when this condition occurs:
[2021-11-08 18:30:01,688] [INFO] [unfused_optimizer.py:275:_update_scale] Grad overflow on iteration: 2983
[2021-11-08 18:30:01,688] [INFO] [unfused_optimizer.py:276:_update_scale] Reducing dynamic loss scale from 65536.0 to 32768.0
[2021-11-08 18:30:01,689] [INFO] [unfused_optimizer.py:199:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
it tries to allocate more memory - a whooping 1GB!
fp32_param.grad = fp16_param.grad.to(fp32_param.dtype)
RuntimeError: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 1; 31.75 GiB total capacity; 24.28 GiB already allocated; 256.00 MiB free; 30.01 GiB reserved in total by PyTorch)RuntimeError
: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 3; 31.75 GiB total capacity; 24.28 GiB already allocated; 276.00 MiB free; 30.01 GiB reserved in total by PyTorch)
I was running at 31 out of 32GB gpu memory used for many hours until the above occurred and it threw it into OOM.
Perhaps that logic needs to free memory it doesn’t need any more first before allocating new memory?
e.g. this problem happened on a single SLURM job after running for 15h just fine.
This is with ZeRO-1 with Megatron-Deepspeed.
Thank you!
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
What are Out Of Memory (OOM) Crashes and How to Avoid ...
Learn about out-of-memory (OOM) crashes, their impact on your users and how to avoid them to improve your app's user experience and quality....
Read more >How to Adjust Linux Out-Of-Memory Killer Settings ... - Percona
If you are getting the “running out of space” error, the only solution is to clear some space and restart your database.
Read more >Troubleshoot out-of-memory loops - InfluxData Documentation
Out-of-memory (OOM) loops occur when a running process consumes an increasing amount of memory until the operating system is forced to kill and...
Read more >Identify & Handle Android Builds' Memory Issues - Medium
Allocating a large amount of memory for those daemons can be a problem because other apps like Android Studio and Chrome browsers also...
Read more >Hunting Down Memory Issues In Ruby: A Definitive Guide
If you thought bugs were pesky, wait until you hunt for memory issues. As with all forms of optimization, odds are that it...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
AFAIK, pytorch doesn’t have GC. It’s all python’s work. Once python frees a variable, pytorch automatically makes the memory available to new allocations (but it keeps it in its cache).
Pytorch can’t be aware of how Python assigns to a variable. e.g. you can see here how even doing
a+b+c
creates a peak memory overhead: https://github.com/pytorch/pytorch/issues/27522#issuecomment-975041172As I tried to briefly describe above Python doesn’t necessarily free a variable when it goes out of scope or even when it gets deleted explicitly, since it depends on potentially other objects referencing it. When the reference count goes to 0 the object and the memory it occupies is freed. GC has a specific schedule when it performs its cycle when it identifies which objects can be freed and frees them. So at critical points one may have to call
gc.collect()
explicitly in addition todel
. Though if the object is not referenced by any other objectdel
should be sufficient to free memory.In our situation of working with huge memory chunks, an explicit
gc.collect
call introduces no additional overhead that is of a significance.There is an indepth guide on python GC if you’re interested:
Sounds like this particular scenario might just be a corner case, but still needs to be confirmed. There might not be much we can do about it besides avoiding flying too close to the sun.