question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA RuntimeError during losses.backward()

See original GitHub issue

❓ Questions and Help

Hi. Thank you for your great efforts.

I’m trying to use my own dataset which only has a single class. I implemented e.g. maskrcnn/data/datasets/mydata.py following README, set ROI_BOX_HEAD.NUM_CLASSES=2, and updated some other files correspondingly. (Detailed description is omitted since the error can be simply reproduced in a different way as described below)

Error message

Traceback (most recent call last):
  File "tools/train_net.py", line 174, in <module>
    main()
  File "tools/train_net.py", line 167, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 76, in do_train
    losses.backward()
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 106, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch-nightly_1549566635986/work/aten/src/THC/THCBlas.cu:259

How to reproduce

  1. Build docker image & setup maskrcnn-benchmark Due to #167, I commented out https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/docker/Dockerfile#L51 And run nvidia-docker build -t maskrcnn-benchmark docker/ Afterward, run python setup.py build develop inside the container

  2. Add this line target.extra_fields['labels'].clamp_(0, 1) above here: https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/maskrcnn_benchmark/data/datasets/coco.py#L91-L96 This will reduce 80 classes into a single foreground class

  3. Place COCO dataset in /maskrcnn-benchmark/datasets/coco

  4. Run (single GPU)

python tools/train_net.py \
    --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" \
    MODEL.ROI_BOX_HEAD.NUM_CLASSES 2 \
    SOLVER.IMS_PER_BATCH 2

I found that increasing NUM_CLASSES makes a few iterations successful. What am I missing? Please help!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
ClimbsRockscommented, Feb 20, 2019

For anyone else running into this issue, I was able to solve it by installing the latest pytorch (pytorch-nightly still seemed to give this error, even recent versions, but I may have just been screwing something up), and then reinstalling this library. I didn’t reinstall the library at first, and it caused some headaches.

In short,

conda uninstall pytorch-nightly
conda install pytorch -c pytorch
cd path/to/maskrcnn-benchmark
rm -rf build # Remove the previous build files
rm -rf maskrcnn_benchmark.egg-info # Remove metadata about the previous build
python setup.py build develop

Again, for me, it’s important to run that last step. Seems kinda obvious in hindsight, but not at the time 😃

2reactions
limbeecommented, Feb 15, 2019

Yes. It turned out to be a problem with PyTorch-nightly. I was using it because Docker installs it.

The previous version of PyTorch was

>>> torch.__version__
'1.0.0.dev20190207'

After I replaced this line https://github.com/facebookresearch/maskrcnn-benchmark/blob/327bc29bcc4924e35bd61c59877d5a1d25bb75af/docker/Dockerfile#L35 with RUN conda install -y pytorch -c pytorch \ , the problem was resolved.

Now the PyTorch version is

>>> torch.__version__
'1.0.1.post2'

Thanks a lot!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Strange error from loss.backward() - autograd - PyTorch Forums
I got this error when doing the backward pass: loss.backward() File ... torch.cuda.is_available() checks if cuda(or gpu) is available to you ...
Read more >
CUBLAS_STATUS_ALLOC_FAI...
CUDA error: CUBLAS_STATUS_ALLOC_FAILED when running loss.backward() ; range(num_epochs): for ; enumerate(train_loader): features = features.view(- ...
Read more >
RuntimeError: CUDA out of memory - Deep Graph Library
Hi everyone: Im following this tutorial and training a RGCN in a GPU: ... call last): File "rgcn.py", line 232, in <module> loss.backward()...
Read more >
CUDA Error: Device-Side Assert Triggered: Solved | Built In
A CUDA error: device-side assert triggered is an error that's often caused when you either have inconsistency between the number of labels and ......
Read more >
Pytorch throws CUDA runtime error on WSL2
However, when I train a network and call the backward() method of loss, torch throws a runtime error like this:
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found