question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory error

See original GitHub issue

推荐使用英语模板 General question,以便你的问题帮助更多人。

首先确认以下内容

  • 我已经查询了相关的 issue,但没有找到需要的帮助。
  • 我已经阅读了相关文档,但仍不知道如何解决。

描述你遇到的问题

I start a training task use this command:

srun -p kshdtest --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --kill-on-bad-exit=1 python -u tools/train.py configs/swin_transformer/swin_tiny_224_b16x64_300e_imagenet.py --launcher=slurm

After initialization, it reports an out of memory error.

2021-10-05 16:36:27,739 - mmcls - INFO - workflow: [('train', 1)], max: 300 epochs                                                                                                                                    
slurmstepd: error: Detected 254 oom-kill event(s) in step 12272627.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
                                     
srun: error: j17r3n18: task 3: Out Of Memory

But if I use configs/resnet/resnet18_b32x8_imagenet.py instead, it can start the training. Is there any problems with swin-tiny?

[填写这里]

相关信息

  1. pip list | grep "mmcv\|mmcls\|^torch" 命令的输出
mmcls              0.15.0                                                                                                                                      
mmcv-full          1.3.8
torch              1.7.0a0                                                                                                                                                                                            
torchvision        0.8.0a0+132984f 
  1. 如果你修改了,或者使用了新的配置文件,请在这里写明 configs/swin_transformer/swin_tiny_224_b16x64_300e_imagenet.py
_base_ = [                                                                                                                                                                                                            
    '../_base_/models/swin_transformer/tiny_224.py',
    '../_base_/datasets/imagenet_bs64_swin_224.py',
    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
    '../_base_/default_runtime.py'
]
evaluation = dict(interval=25)
checkpoint_config = dict(interval=25)
  1. 如果你是在训练过程中遇到的问题,请填写完整的训练日志和报错信息 slurmstepd: error: Detected 263 oom-kill event(s) in step 12273709.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: j17r3n18: task 1: Out Of Memory srun: Terminating job step 12273709.0 slurmstepd: error: *** STEP 12273709.0 ON j17r3n18 CANCELLED AT 2021-10-05T19:50:34 ***
  2. 如果你对 mmcls 文件夹下的代码做了其他相关的修改,请在这里写明 mmcls/datasets/builder.py
# line 54
persistent_workers=False,

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
mzr1996commented, Oct 9, 2021

Hi @mzr1996 Thanks! It works for me.

Glad to hear that, but there are some other modifications you need to make. Because the batch size and learning rate should match. So here are two ways to achieve it.

  1. Reduce the learning rate at the same time.
_base_ = [
    '../_base_/models/swin_transformer/tiny_224.py',
    '../_base_/datasets/imagenet_bs64_swin_224.py',
    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
    '../_base_/default_runtime.py'
]
data = dict(samples_per_gpu=32)
optimizer = dict(lr=5e-4 * 32 * 4 / 512)   # <-- For samples_per_gpu=32 and 4 GPUs.
  1. Use GradientCumulativeOptimizerHook, refers to https://github.com/open-mmlab/mmcv/pull/1221
_base_ = [
    '../_base_/models/swin_transformer/tiny_224.py',
    '../_base_/datasets/imagenet_bs64_swin_224.py',
    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
    '../_base_/default_runtime.py'
]
data = dict(samples_per_gpu=32)
# Our original batch size is 16 * 64, here you use 4 * 32, so the cumulative_iters=8
optimizer_config = dict(type='GradientCumulativeOptimizerHook', cumulative_iters=8)

The second method almost keeps the original training config, but it’s a new feature, which means it may be not stable.

0reactions
xiexinchcommented, Oct 9, 2021

Thanks!I will have a try.

Read more comments on GitHub >

github_iconTop Results From Across the Web

3.2 Understand the OutOfMemoryError Exception
One common indication of a memory leak is the java.lang.OutOfMemoryError exception. Usually, this error is thrown when there is insufficient space to ...
Read more >
Steps to Fix Out of Memory Error in Windows 10, 8.1, 8, 7
Out-of-memory error is an often unwanted state of computer performance. Where no additional memory can be assigned for use by applications or ...
Read more >
What Does "out of Memory" Mean? - EasyTechJunkie
"Out of memory" (OOM) is an error message seen when a computer no longer has any spare memory to allocate to programs. An...
Read more >
The 4 general reasons for OutOfMemoryError errors and ... - IBM
Terminology: OutOfMemoryError – a Java error, like an exception, but worse. Normally this error indicates a shortage of Java heap, but can also...
Read more >
Fix Error Code: Out of Memory [Browser Edge, Chrome, Brave]
What does Error code: Out of memory mean? ... This error implies that the resources or memory available in the Edge browser are...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found