Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA:out of memory error

See original GitHub issue

Bug Hi,

I have seen your discussions over the error ‘GPU out of memory’ but there is no concrete outcome/solution to it in Issue-188.

I am trying to train a custom ID card detection set for segmentation purposes. Below are the specs:Ubuntu 18.04NVIDIA Graphics: 840M, 2Gb Built in

I do not have any extra graphic card as it is a Proof of Concept stage of the project.Initially I faced few problems to install mm-detection locally. Although now installed I am facing problems in training the custom dataset with the error:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.96 GiB total 
capacity; 1.22 GiB already allocated; 19.00 MiB free; 39.15 MiB cached)

As per the resolution in Issue-188, to change GPU to CPU. transform.py has been deleted . How to go about this?

Reproduction

!python mmdetection/tools/train.py {config_fname}

https://colab.research.google.com/github/Tony607/mmdetection_instance_segmentation_demo/blob/master/mmdetection_train_custom_coco_data_segmentation.ipynb

I am currently using an ID card dataset with around 50 images

Environment

OS: [e.g., Ubuntu 18.04]

GCC [e.g., 7.4.0]

PyTorch version [e.g., 1.3.0]

conda

GPU model GeForce 800M Series (Notebook)(840M model)

CUDA 10.0

!python mmdetection/tools/train.py {config_fname}

/home/nqe00239/projects/mmdetection_instance_segmentation_demo

2019-10-24 15:04:19,661 - INFO - Distributed training: False
2019-10-24 15:04:20,037 - INFO - load model from: torchvision://resnet50
2019-10-24 15:04:20,229 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2019-10-24 15:04:23,544 - INFO - Start running, host: nqe00239@nqe00239-Latitude-E7450, work_dir: /home/nqe00239/projects/mmdetection_instance_segmentation_demo/work_dirs/mask_rcnn_r50_fpn_1x
2019-10-24 15:04:23,544 - INFO - workflow: [('train', 1)], max: 20 epochs
Traceback (most recent call last):
  File "mmdetection/tools/train.py", line 108, in <module>
    main()
  File "mmdetection/tools/train.py", line 104, in main
    logger=logger)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/apis/train.py", line 60, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/apis/train.py", line 221, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/apis/train.py", line 38, in batch_processor
    losses = model(**data)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/models/detectors/base.py", line 86, in forward
    return self.forward_train(img, img_meta, **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/models/detectors/two_stage.py", line 211, in forward_train
    mask_pred = self.mask_head(mask_feats)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/mmdet-1.0rc0+unknown-py3.7-linux-x86_64.egg/mmdet/models/mask_heads/fcn_mask_head.py", line 99, in forward
    x = self.upsample(x)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nqe00239/anaconda3/envs/conda_jupyter_envs/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 778, in forward
    output_padding, self.groups, self.dilation)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.96 GiB total capacity; 1.22 GiB already allocated; 19.00 MiB free; 39.15 MiB cached)

Issue Analytics

State:
Created 4 years ago
Comments:6

Top GitHub Comments

4reactions

pd2871commented, Jun 23, 2022

Sometimes, CUDA out of memory error happens because of having too many bboxes in a training image, (>500). It was also the case of mine which was not solved by using sample_per_gpu=1

So, if it is also your case, just follow these steps:

Open this file:

mmdetection/mmdet/core/bbox/assigners/max_iou_assigner.py

In the line (~ 99-100) there is this line of code:

assign_on_cpu = True if (self.gpu_assign_thr > 0) and (
            gt_bboxes.shape[0] > self.gpu_assign_thr) else False

Change that line of code to:

assign_on_cpu = True

So, basically here we are doing bbox operation entirely on CPU rather than on GPU (when there is case of large bboxes count in a single image)

2reactions

ZwwWaynecommented, Nov 27, 2019

This is because the GPU memory is indeed a little bit small. You can use a smaller image size (e.g., 512x512) and set img_per_gpu=1 and see whether it works.

Top Results From Across the Web

"RuntimeError: CUDA error: out of memory" - Stack Overflow

The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...

Solving the “RuntimeError: CUDA Out of memory” error

Solving the “RuntimeError: CUDA Out of memory” error · Reduce the `batch_size` · Lower the Precision · Do what the error says ·...

Resolving CUDA Being Out of Memory With Gradient ...

Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models which requires ...

Solving "CUDA out of memory" Error - Kaggle

Solving "CUDA out of memory" Error · 1) Use this code to see memory usage (it requires internet to install package): · 2)...

CUDA out-of-mem error - Chaos Help Center

This error message indicates that a project is too complex to be cached in the GPU's memory. Each project contains a certain amount...