question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA error: out of memory

See original GitHub issue

❓ Questions and Help

when train my own dataset using Resnet101 backbone after 27k iterations, it always encouters this problem as below:

File "maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 75, in do_train
    losses.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: out of memory

btw, the input size is set to be (800, 1333).

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:23 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
zimenglan-sysu-512commented, Nov 19, 2018

hi @fmassa,

The OOM problem has been solved. beacuse i duplicated the ground-truths several times, making the number of gt bboxes to be 2k. (very sorry for that). btw, if using cpu to compute the IoUs for prediction and gt, not only need to modify these lines, but also need to pay attention to the few lines: so that it can deal with a large amout of gt bboxes at cost of slowing the training speed (maybe training time is doubled).

about the hanging, since i upgrade ubuntu 14.04 to 16.04, install cuda 9.0 (or cuda 9.2) with difference nvidia-drivers (390, 396, 410), it sometime happens. as @chengyangfu said, when use nvidia-driver 410, the frequency is much lower.

thanks!

1reaction
zimenglan-sysu-512commented, Nov 16, 2018

thanks @fmassa . after update ubuntu 14.04 to 16.04, i will try what u suggest, and then report my results here. thanks again.

Read more comments on GitHub >

github_iconTop Results From Across the Web

"RuntimeError: CUDA error: out of memory" - Stack Overflow
The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...
Read more >
Solving the “RuntimeError: CUDA Out of memory” error
Solving the “RuntimeError: CUDA Out of memory” error · Reduce the `batch_size` · Lower the Precision · Do what the error says ·...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
So when you try to execute the training, and you don't have enough free CUDA memory available, then the framework you're using throws...
Read more >
Solving "CUDA out of memory" Error - Kaggle
If you try to train multiple models on GPU, you are most likely to encounter some error similar to this one: RuntimeError: CUDA...
Read more >
Help CUDA error: out of memory - PyTorch Forums
The error is raised if you are running our of memory on your device, so you could try to reduce the memory requirement...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found