Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory error

See original GitHub issue

推荐使用英语模板 General question，以便你的问题帮助更多人。

首先确认以下内容

我已经查询了相关的 issue，但没有找到需要的帮助。
我已经阅读了相关文档，但仍不知道如何解决。

描述你遇到的问题

I start a training task use this command:

srun -p kshdtest --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --kill-on-bad-exit=1 python -u tools/train.py configs/swin_transformer/swin_tiny_224_b16x64_300e_imagenet.py --launcher=slurm

After initialization, it reports an out of memory error.

2021-10-05 16:36:27,739 - mmcls - INFO - workflow: [('train', 1)], max: 300 epochs                                                                                                                                    
slurmstepd: error: Detected 254 oom-kill event(s) in step 12272627.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
                                     
srun: error: j17r3n18: task 3: Out Of Memory

But if I use configs/resnet/resnet18_b32x8_imagenet.py instead, it can start the training. Is there any problems with swin-tiny?

[填写这里]

相关信息

pip list | grep "mmcv\|mmcls\|^torch" 命令的输出

mmcls              0.15.0                                                                                                                                      
mmcv-full          1.3.8
torch              1.7.0a0                                                                                                                                                                                            
torchvision        0.8.0a0+132984f

如果你修改了，或者使用了新的配置文件，请在这里写明 configs/swin_transformer/swin_tiny_224_b16x64_300e_imagenet.py

_base_ = [                                                                                                                                                                                                            
    '../_base_/models/swin_transformer/tiny_224.py',
    '../_base_/datasets/imagenet_bs64_swin_224.py',
    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
    '../_base_/default_runtime.py'
]
evaluation = dict(interval=25)
checkpoint_config = dict(interval=25)

如果你是在训练过程中遇到的问题，请填写完整的训练日志和报错信息 slurmstepd: error: Detected 263 oom-kill event(s) in step 12273709.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: j17r3n18: task 1: Out Of Memory srun: Terminating job step 12273709.0 slurmstepd: error: *** STEP 12273709.0 ON j17r3n18 CANCELLED AT 2021-10-05T19:50:34 ***
如果你对 mmcls 文件夹下的代码做了其他相关的修改，请在这里写明 mmcls/datasets/builder.py

# line 54
persistent_workers=False,

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

mzr1996commented, Oct 9, 2021

Hi @mzr1996 Thanks! It works for me.

Glad to hear that, but there are some other modifications you need to make. Because the batch size and learning rate should match. So here are two ways to achieve it.

Reduce the learning rate at the same time.

_base_ = [
    '../_base_/models/swin_transformer/tiny_224.py',
    '../_base_/datasets/imagenet_bs64_swin_224.py',
    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
    '../_base_/default_runtime.py'
]
data = dict(samples_per_gpu=32)
optimizer = dict(lr=5e-4 * 32 * 4 / 512)   # <-- For samples_per_gpu=32 and 4 GPUs.

Use GradientCumulativeOptimizerHook, refers to https://github.com/open-mmlab/mmcv/pull/1221

_base_ = [
    '../_base_/models/swin_transformer/tiny_224.py',
    '../_base_/datasets/imagenet_bs64_swin_224.py',
    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
    '../_base_/default_runtime.py'
]
data = dict(samples_per_gpu=32)
# Our original batch size is 16 * 64, here you use 4 * 32, so the cumulative_iters=8
optimizer_config = dict(type='GradientCumulativeOptimizerHook', cumulative_iters=8)

The second method almost keeps the original training config, but it’s a new feature, which means it may be not stable.

0reactions

xiexinchcommented, Oct 9, 2021

Thanks！I will have a try.

Top Results From Across the Web

3.2 Understand the OutOfMemoryError Exception

One common indication of a memory leak is the java.lang.OutOfMemoryError exception. Usually, this error is thrown when there is insufficient space to ...

Steps to Fix Out of Memory Error in Windows 10, 8.1, 8, 7

Out-of-memory error is an often unwanted state of computer performance. Where no additional memory can be assigned for use by applications or ...

What Does "out of Memory" Mean? - EasyTechJunkie

"Out of memory" (OOM) is an error message seen when a computer no longer has any spare memory to allocate to programs. An...

The 4 general reasons for OutOfMemoryError errors and ... - IBM

Terminology: OutOfMemoryError – a Java error, like an exception, but worse. Normally this error indicates a shortage of Java heap, but can also...

Fix Error Code: Out of Memory [Browser Edge, Chrome, Brave]

What does Error code: Out of memory mean? ... This error implies that the resources or memory available in the Edge browser are...