Out of memory error
See original GitHub issue推荐使用英语模板 General question,以便你的问题帮助更多人。
首先确认以下内容
- 我已经查询了相关的 issue,但没有找到需要的帮助。
- 我已经阅读了相关文档,但仍不知道如何解决。
描述你遇到的问题
I start a training task use this command:
srun -p kshdtest --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --kill-on-bad-exit=1 python -u tools/train.py configs/swin_transformer/swin_tiny_224_b16x64_300e_imagenet.py --launcher=slurm
After initialization, it reports an out of memory
error.
2021-10-05 16:36:27,739 - mmcls - INFO - workflow: [('train', 1)], max: 300 epochs
slurmstepd: error: Detected 254 oom-kill event(s) in step 12272627.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: j17r3n18: task 3: Out Of Memory
But if I use configs/resnet/resnet18_b32x8_imagenet.py
instead, it can start the training.
Is there any problems with swin-tiny?
[填写这里]
相关信息
pip list | grep "mmcv\|mmcls\|^torch"
命令的输出
mmcls 0.15.0
mmcv-full 1.3.8
torch 1.7.0a0
torchvision 0.8.0a0+132984f
- 如果你修改了,或者使用了新的配置文件,请在这里写明 configs/swin_transformer/swin_tiny_224_b16x64_300e_imagenet.py
_base_ = [
'../_base_/models/swin_transformer/tiny_224.py',
'../_base_/datasets/imagenet_bs64_swin_224.py',
'../_base_/schedules/imagenet_bs1024_adamw_swin.py',
'../_base_/default_runtime.py'
]
evaluation = dict(interval=25)
checkpoint_config = dict(interval=25)
- 如果你是在训练过程中遇到的问题,请填写完整的训练日志和报错信息 slurmstepd: error: Detected 263 oom-kill event(s) in step 12273709.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: j17r3n18: task 1: Out Of Memory srun: Terminating job step 12273709.0 slurmstepd: error: *** STEP 12273709.0 ON j17r3n18 CANCELLED AT 2021-10-05T19:50:34 ***
- 如果你对
mmcls
文件夹下的代码做了其他相关的修改,请在这里写明 mmcls/datasets/builder.py
# line 54
persistent_workers=False,
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
3.2 Understand the OutOfMemoryError Exception
One common indication of a memory leak is the java.lang.OutOfMemoryError exception. Usually, this error is thrown when there is insufficient space to ...
Read more >Steps to Fix Out of Memory Error in Windows 10, 8.1, 8, 7
Out-of-memory error is an often unwanted state of computer performance. Where no additional memory can be assigned for use by applications or ...
Read more >What Does "out of Memory" Mean? - EasyTechJunkie
"Out of memory" (OOM) is an error message seen when a computer no longer has any spare memory to allocate to programs. An...
Read more >The 4 general reasons for OutOfMemoryError errors and ... - IBM
Terminology: OutOfMemoryError – a Java error, like an exception, but worse. Normally this error indicates a shortage of Java heap, but can also...
Read more >Fix Error Code: Out of Memory [Browser Edge, Chrome, Brave]
What does Error code: Out of memory mean? ... This error implies that the resources or memory available in the Edge browser are...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Glad to hear that, but there are some other modifications you need to make. Because the batch size and learning rate should match. So here are two ways to achieve it.
GradientCumulativeOptimizerHook
, refers to https://github.com/open-mmlab/mmcv/pull/1221The second method almost keeps the original training config, but it’s a new feature, which means it may be not stable.
Thanks!I will have a try.