CUDA:out of memory error
See original GitHub issueBug Hi,
I have seen your discussions over the error ‘GPU out of memory’ but there is no concrete outcome/solution to it in Issue-188.
I am trying to train a custom ID card detection set for segmentation purposes. Below are the specs:Ubuntu 18.04NVIDIA Graphics: 840M, 2Gb Built in
I do not have any extra graphic card as it is a Proof of Concept stage of the project.Initially I faced few problems to install mm-detection locally. Although now installed I am facing problems in training the custom dataset with the error:
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.96 GiB total
capacity; 1.22 GiB already allocated; 19.00 MiB free; 39.15 MiB cached)
As per the resolution in Issue-188, to change GPU to CPU.
transform.py
has been deleted . How to go about this?
Reproduction
!python mmdetection/tools/train.py {config_fname}
I am currently using an ID card dataset with around 50 images
Environment
- OS: [e.g., Ubuntu 18.04]
- GCC [e.g., 7.4.0]
- PyTorch version [e.g., 1.3.0]
- conda
- GPU model GeForce 800M Series (Notebook)(840M model)
- CUDA 10.0
!python mmdetection/tools/train.py {config_fname}
/home/nqe00239/projects/mmdetection_instance_segmentation_demo
2019-10-24 15:04:19,661 - INFO - Distributed training: False
2019-10-24 15:04:20,037 - INFO - load model from: torchvision://resnet50
2019-10-24 15:04:20,229 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2019-10-24 15:04:23,544 - INFO - Start running, host: nqe00239@nqe00239-Latitude-E7450, work_dir: /home/nqe00239/projects/mmdetection_instance_segmentation_demo/work_dirs/mask_rcnn_r50_fpn_1x
2019-10-24 15:04:23,544 - INFO - workflow: [('train', 1)], max: 20 epochs
Traceback (most recent call last):
File "mmdetection/tools/train.py", line 108, in <module>
main()
File "mmdetection/tools/train.py", line 104, in main
logger=logger)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/apis/train.py", line 60, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/apis/train.py", line 221, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/apis/train.py", line 38, in batch_processor
losses = model(**data)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/core/fp16/decorators.py", line 49, in new_func
return old_func(*args, **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/models/detectors/two_stage.py", line 211, in forward_train
mask_pred = self.mask_head(mask_feats)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/core/fp16/decorators.py", line 49, in new_func
return old_func(*args, **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/models/mask_heads/fcn_mask_head.py", line 99, in forward
x = self.upsample(x)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 778, in forward
output_padding, self.groups, self.dilation)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.96 GiB total capacity; 1.22 GiB already allocated; 19.00 MiB free; 39.15 MiB cached)
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
Top GitHub Comments
Sometimes,
CUDA out of memory error
happens because of having too many bboxes in a training image, (>500). It was also the case of mine which was not solved by usingsample_per_gpu=1
So, if it is also your case, just follow these steps:
So, basically here we are doing bbox operation entirely on CPU rather than on GPU (when there is case of large bboxes count in a single image)
This is because the GPU memory is indeed a little bit small. You can use a smaller image size (e.g., 512x512) and set img_per_gpu=1 and see whether it works.