question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training on a pre-trained model: RuntimeError: CUDA error: out of memory

See original GitHub issue

🐛 Bug

I am launching training on a pretrained model and a 2 classes coco like dataset.

To Reproduce

Steps to reproduce the behavior:

  1. Run training with this command line

python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 10 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

Where myconfig.yaml points out to mymodel.pth like this: WEIGHT: "/Users/karimimohammedbelhal/.torch/models/mymodel" And mymodel.pth is a pre trained model with the right keys deleted as suggested in #15

Expected behavior

Training should start and complete.

Environment

PyTorch version: 1.0.0.dev20181123 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.1 LTS GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 CMake version: version 3.10.2

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 9.1.85 GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 396.51 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries: [pip3] numpy (1.13.3) [pip3] torch (0.4.1) [pip3] torchvision (0.2.1) [conda] pytorch-nightly 1.0.0.dev20181123 py3.7_cuda9.0.176_cudnn7.4.1_0 pytorch

Returned Error

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 31, in train
    model.to(device)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: out of memory

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
zimenglan-sysu-512commented, Dec 1, 2018

if u use single gpu to train a model, u should change the IMS_PER_BATCH to be small enough (e.g. IMS_PER_BATCH=2).

0reactions
lalalafloatcommented, Oct 17, 2022

I solved this problem by rebooting the server.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolving CUDA Being Out of Memory With Gradient ...
So when you try to execute the training, and you don't have enough free CUDA memory available, then the framework you're using throws...
Read more >
Solving the “RuntimeError: CUDA Out of memory” error
Solving the “RuntimeError: CUDA Out of memory” error · Changing the batchsize · Lower the Precision · Doing what the error says in...
Read more >
RuntimeError: CUDA out of memory with pre-trained model
It may sound dumb, but try to execute this command in terminal: pkill -9 python. be careful tho, this command kills all python...
Read more >
Cuda out of memory error - Intermediate - Hugging Face Forums
I encounter the below error when I finetune my dataset on mbart. RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU...
Read more >
RuntimeError: CUDA out of memory + gpu ... - PyTorch Forums
I am training a pre-trained inception V3 model for cifar10 data for epochs=10 ... of training colab throws error related to RuntimeError: CUDA...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found