question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] memory overhead issue with optimizer leading to OOM

See original GitHub issue

Could it be that unfused_optimizer is not careful at how it allocates memory when this condition occurs:

[2021-11-08 18:30:01,688] [INFO] [unfused_optimizer.py:275:_update_scale] Grad overflow on iteration: 2983
[2021-11-08 18:30:01,688] [INFO] [unfused_optimizer.py:276:_update_scale] Reducing dynamic loss scale from 65536.0 to 32768.0
[2021-11-08 18:30:01,689] [INFO] [unfused_optimizer.py:199:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0

it tries to allocate more memory - a whooping 1GB!

    fp32_param.grad = fp16_param.grad.to(fp32_param.dtype)
RuntimeError: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 1; 31.75 GiB total capacity; 24.28 GiB already allocated; 256.00 MiB free; 30.01 GiB reserved in total by PyTorch)RuntimeError
: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 3; 31.75 GiB total capacity; 24.28 GiB already allocated; 276.00 MiB free; 30.01 GiB reserved in total by PyTorch)

I was running at 31 out of 32GB gpu memory used for many hours until the above occurred and it threw it into OOM.

Perhaps that logic needs to free memory it doesn’t need any more first before allocating new memory?

e.g. this problem happened on a single SLURM job after running for 15h just fine.

This is with ZeRO-1 with Megatron-Deepspeed.

Thank you!

@tjruwase

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Nov 25, 2021

AFAIK, pytorch doesn’t have GC. It’s all python’s work. Once python frees a variable, pytorch automatically makes the memory available to new allocations (but it keeps it in its cache).

Pytorch can’t be aware of how Python assigns to a variable. e.g. you can see here how even doing a+b+c creates a peak memory overhead: https://github.com/pytorch/pytorch/issues/27522#issuecomment-975041172

As I tried to briefly describe above Python doesn’t necessarily free a variable when it goes out of scope or even when it gets deleted explicitly, since it depends on potentially other objects referencing it. When the reference count goes to 0 the object and the memory it occupies is freed. GC has a specific schedule when it performs its cycle when it identifies which objects can be freed and frees them. So at critical points one may have to call gc.collect() explicitly in addition to del. Though if the object is not referenced by any other object del should be sufficient to free memory.

In our situation of working with huge memory chunks, an explicit gc.collect call introduces no additional overhead that is of a significance.

There is an indepth guide on python GC if you’re interested:

0reactions
tjruwasecommented, Nov 25, 2021

Sounds like this particular scenario might just be a corner case, but still needs to be confirmed. There might not be much we can do about it besides avoiding flying too close to the sun.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What are Out Of Memory (OOM) Crashes and How to Avoid ...
Learn about out-of-memory (OOM) crashes, their impact on your users and how to avoid them to improve your app's user experience and quality....
Read more >
How to Adjust Linux Out-Of-Memory Killer Settings ... - Percona
If you are getting the “running out of space” error, the only solution is to clear some space and restart your database.
Read more >
Troubleshoot out-of-memory loops - InfluxData Documentation
Out-of-memory (OOM) loops occur when a running process consumes an increasing amount of memory until the operating system is forced to kill and...
Read more >
Identify & Handle Android Builds' Memory Issues - Medium
Allocating a large amount of memory for those daemons can be a problem because other apps like Android Studio and Chrome browsers also...
Read more >
Hunting Down Memory Issues In Ruby: A Definitive Guide
If you thought bugs were pesky, wait until you hunt for memory issues. As with all forms of optimization, odds are that it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found