question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OSError: [Errno 12] Cannot allocate memor

See original GitHub issue

**`(open-mmlab_ldh5) ➜ mmdetection git:(master) ✗ CUDA_VISIBLE_DEVICES=4,5,6,7 ./tools/dist_train.sh ./configs/rpc/faster_rcnn_r50_fpn_1x.py 4 --validate
2019-05-24 20:08:24,708 - INFO - Distributed training: True
2019-05-24 20:08:25,313 - INFO - load model from: modelzoo://resnet50
2019-05-24 20:08:25,611 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer2.2.bn1.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer4.1.bn1.num_batches_tracked, la yer2.0.downsample.1.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer4.2.bn1.num_batches_track ed, layer3.5.bn3.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer4.0.downsample.1.num_batches _tracked, layer3.4.bn3.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer3.1.bn1.num_batches_tr acked, layer2.0.bn3.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer1.2.bn2.num_batches_track ed, layer2.3.bn3.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer3.1.bn2.num_batches_tracked, bn1.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.5 .bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer4.1.bn2.num_batches_tracked, la yer3.2.bn2.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked

loading annotations into memory… loading annotations into memory… loading annotations into memory… loading annotations into memory… Done (t=202.67s) creating index… index created! Done (t=254.98s) creating index… index created! Done (t=278.15s) creating index… Done (t=279.31s) creating index… index created! index created! loading annotations into memory… loading annotations into memory… loading annotations into memory… loading annotations into memory… Done (t=1.17s) creating index… index created! Done (t=1.26s) creating index… index created! Done (t=1.36s) creating index… index created! Done (t=1.82s) creating index… index created! 2019-05-24 20:13:14,064 - INFO - Start running, host: ices@ices-SYS-4028GR-TR, work_dir: /home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x 2019-05-24 20:13:14,065 - INFO - workflow: [(‘train’, 1)], max: 12 epochs Traceback (most recent call last): File “./tools/train.py”, line 95, in <module> main() File “./tools/train.py”, line 91, in main logger=logger) File “/home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/mmdet/apis/train.py”, line 59, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File “/home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/mmdet/apis/train.py”, line 171, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/mmcv/runner/runner.py”, line 356, in run epoch_runner(data_loaders[i], kwargs)
File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/mmcv/runner/runner.py”, line 258, in train
for i, data_batch in enumerate(data_loader):
File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/torch/utils/data/dataloader.py”, line 193, in iter
return _DataLoaderIter(self)
File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/torch/utils/data/dataloader.py”, line 469, in init
w.start()
File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/process.py”, line 112, in start
self._popen = self._Popen(self) File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/context.py”, line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/context.py”, line 284, in _Popen return Popen(process_obj)
File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_spawn_posix.py”, line 32, in init super().init(process_obj) File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_fork.py”, line 20, in init self._launch(process_obj) File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_spawn_posix.py”, line 59, in _launch cmd, self._fds) File “/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/util.py”, line 420, in spawnv_passfds False, False, None) OSError: [Errno 12] Cannot allocate memory `

My dataset is COCO format. The Json file has “segmentation” data. The size of train JSON file is 7.0GB. The numble of picture is 100000(img_size 1851*1851 ). When I train model , it can not load the dataset and the above error will appear.

My server has 252GB of memory . GPU is GeForce GTX 1080Ti and memory-Usage is 11178MiB. I would like to ask whether all data is imported into memory at one time during the training? If the data is too big, how to train.

I hope someone can help me solve the problem, thanks.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
mzk665commented, Sep 24, 2020

image

I found the same problem and solved it by expanding the swap partition size.

Step-by-step solution:

  1. Increase the swap size, such as 2G. dd if=/dev/zero of=/var/swap bs=1024 count=2048000

  2. Setup the swap file. mkswap /var/swap

  3. Activate the swap partition. swapon /var/swap

Good luck!

0reactions
AkihiroSasabecommented, Oct 30, 2020

@mzk665 Thanks for providing the good solution.

I cannot activate the swap partition. The error message is as follows:

root@mmdetection20200628:/mmdetection# swapon /var/swap
swapon: /var/swap: swapon failed: Operation not permitted

Perhaps the docker setting is the reason I can’t activate the file. Are you working in a docker environment? Do you know how to solve the problem ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Solved] Oserror: [Errno 12] Cannot Allocate Memory
Oserror: [errno 12] cannot allocate memory is raised by the system when CPU won't get enough memory resources to process pipelined ...
Read more >
linux - Python subprocess.Popen "OSError: [Errno 12] Cannot ...
ERRORS EAGAIN fork() cannot allocate sufficient memory to copy the parent's page tables and allocate a task structure for the child.
Read more >
OSError: [Errno 12] Cannot allocate memory #796 - GitHub
OSError:[Errono 12] Cannot allocate memory indicates your computer is out of RAM when using train.py --cache . Your options are to remove the...
Read more >
OSError: [Errno 12] Cannot allocate memory - Kaggle
I am getting this error even when I do ls in a kernel, OSError: [Errno 12] Cannot allocate memory, is there a way...
Read more >
pytorch 遇到OSError: [Errno 12] Cannot allocate memory错误 ...
在跑一段之前正常的代码,突然间就报[Errno 12] Cannot allocate memory ,通过排查内存方面的问题,最终解决.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found