Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] memory overhead issue with optimizer leading to OOM

See original GitHub issue

Could it be that unfused_optimizer is not careful at how it allocates memory when this condition occurs:

[2021-11-08 18:30:01,688] [INFO] [unfused_optimizer.py:275:_update_scale] Grad overflow on iteration: 2983
[2021-11-08 18:30:01,688] [INFO] [unfused_optimizer.py:276:_update_scale] Reducing dynamic loss scale from 65536.0 to 32768.0
[2021-11-08 18:30:01,689] [INFO] [unfused_optimizer.py:199:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0

it tries to allocate more memory - a whooping 1GB!

    fp32_param.grad = fp16_param.grad.to(fp32_param.dtype)
RuntimeError: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 1; 31.75 GiB total capacity; 24.28 GiB already allocated; 256.00 MiB free; 30.01 GiB reserved in total by PyTorch)RuntimeError
: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 3; 31.75 GiB total capacity; 24.28 GiB already allocated; 276.00 MiB free; 30.01 GiB reserved in total by PyTorch)

I was running at 31 out of 32GB gpu memory used for many hours until the above occurred and it threw it into OOM.

Perhaps that logic needs to free memory it doesn’t need any more first before allocating new memory?

e.g. this problem happened on a single SLURM job after running for 15h just fine.

This is with ZeRO-1 with Megatron-Deepspeed.

Thank you!

@tjruwase

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Nov 25, 2021

AFAIK, pytorch doesn’t have GC. It’s all python’s work. Once python frees a variable, pytorch automatically makes the memory available to new allocations (but it keeps it in its cache).

Pytorch can’t be aware of how Python assigns to a variable. e.g. you can see here how even doing a+b+c creates a peak memory overhead: https://github.com/pytorch/pytorch/issues/27522#issuecomment-975041172

As I tried to briefly describe above Python doesn’t necessarily free a variable when it goes out of scope or even when it gets deleted explicitly, since it depends on potentially other objects referencing it. When the reference count goes to 0 the object and the memory it occupies is freed. GC has a specific schedule when it performs its cycle when it identifies which objects can be freed and frees them. So at critical points one may have to call gc.collect() explicitly in addition to del. Though if the object is not referenced by any other object del should be sufficient to free memory.

In our situation of working with huge memory chunks, an explicit gc.collect call introduces no additional overhead that is of a significance.

There is an indepth guide on python GC if you’re interested:

https://devguide.python.org/garbage_collector and a few other resources on the same:
http://cms.digi.com/resources/documentation/digidocs/90001537/#references/r_python_garbage_coll.htm
https://stackify.com/python-garbage-collection/

0reactions

tjruwasecommented, Nov 25, 2021

Sounds like this particular scenario might just be a corner case, but still needs to be confirmed. There might not be much we can do about it besides avoiding flying too close to the sun.

Top Results From Across the Web

What are Out Of Memory (OOM) Crashes and How to Avoid ...

Learn about out-of-memory (OOM) crashes, their impact on your users and how to avoid them to improve your app's user experience and quality....

How to Adjust Linux Out-Of-Memory Killer Settings ... - Percona

If you are getting the “running out of space” error, the only solution is to clear some space and restart your database.

Troubleshoot out-of-memory loops - InfluxData Documentation

Out-of-memory (OOM) loops occur when a running process consumes an increasing amount of memory until the operating system is forced to kill and...

Identify & Handle Android Builds' Memory Issues - Medium

Allocating a large amount of memory for those daemons can be a problem because other apps like Android Studio and Chrome browsers also...

Hunting Down Memory Issues In Ruby: A Definitive Guide

If you thought bugs were pesky, wait until you hunt for memory issues. As with all forms of optimization, odds are that it...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[BUG] memory overhead issue with optimizer leading to OOM

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[BUG] 'NoneType' object has no attribute 'reserve_partitioned_swap_space' with params offloading to nvme enabled

[BUG] Can not install DeepSpeed with DS_BUILD_OPS=1 or JIT-compile ops at runtime