Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA RuntimeError during losses.backward()

See original GitHub issue

❓ Questions and Help

Hi. Thank you for your great efforts.

I’m trying to use my own dataset which only has a single class. I implemented e.g. maskrcnn/data/datasets/mydata.py following README, set ROI_BOX_HEAD.NUM_CLASSES=2, and updated some other files correspondingly. (Detailed description is omitted since the error can be simply reproduced in a different way as described below)

Error message

Traceback (most recent call last):
  File "tools/train_net.py", line 174, in <module>
    main()
  File "tools/train_net.py", line 167, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 76, in do_train
    losses.backward()
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 106, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch-nightly_1549566635986/work/aten/src/THC/THCBlas.cu:259

How to reproduce

Build docker image & setup maskrcnn-benchmark Due to #167, I commented out https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/docker/Dockerfile#L51 And run nvidia-docker build -t maskrcnn-benchmark docker/ Afterward, run python setup.py build develop inside the container
Add this line target.extra_fields['labels'].clamp_(0, 1) above here: https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/maskrcnn_benchmark/data/datasets/coco.py#L91-L96 This will reduce 80 classes into a single foreground class
Place COCO dataset in /maskrcnn-benchmark/datasets/coco
Run (single GPU)

python tools/train_net.py \
    --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" \
    MODEL.ROI_BOX_HEAD.NUM_CLASSES 2 \
    SOLVER.IMS_PER_BATCH 2

I found that increasing NUM_CLASSES makes a few iterations successful. What am I missing? Please help!

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:6 (5 by maintainers)

Top GitHub Comments

3reactions

ClimbsRockscommented, Feb 20, 2019

For anyone else running into this issue, I was able to solve it by installing the latest pytorch (pytorch-nightly still seemed to give this error, even recent versions, but I may have just been screwing something up), and then reinstalling this library. I didn’t reinstall the library at first, and it caused some headaches.

In short,

conda uninstall pytorch-nightly
conda install pytorch -c pytorch
cd path/to/maskrcnn-benchmark
rm -rf build # Remove the previous build files
rm -rf maskrcnn_benchmark.egg-info # Remove metadata about the previous build
python setup.py build develop

Again, for me, it’s important to run that last step. Seems kinda obvious in hindsight, but not at the time 😃

2reactions

limbeecommented, Feb 15, 2019

Yes. It turned out to be a problem with PyTorch-nightly. I was using it because Docker installs it.

The previous version of PyTorch was

>>> torch.__version__
'1.0.0.dev20190207'

After I replaced this line https://github.com/facebookresearch/maskrcnn-benchmark/blob/327bc29bcc4924e35bd61c59877d5a1d25bb75af/docker/Dockerfile#L35 with RUN conda install -y pytorch -c pytorch \ , the problem was resolved.

Now the PyTorch version is

>>> torch.__version__
'1.0.1.post2'

Thanks a lot!

Top Results From Across the Web

Strange error from loss.backward() - autograd - PyTorch Forums

I got this error when doing the backward pass: loss.backward() File ... torch.cuda.is_available() checks if cuda(or gpu) is available to you ...

CUBLAS_STATUS_ALLOC_FAI...

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when running loss.backward() ; range(num_epochs): for ; enumerate(train_loader): features = features.view(- ...

RuntimeError: CUDA out of memory - Deep Graph Library

Hi everyone: Im following this tutorial and training a RGCN in a GPU: ... call last): File "rgcn.py", line 232, in <module> loss.backward()...

CUDA Error: Device-Side Assert Triggered: Solved | Built In

A CUDA error: device-side assert triggered is an error that's often caused when you either have inconsistency between the number of labels and ......

Pytorch throws CUDA runtime error on WSL2

However, when I train a network and call the backward() method of loss, torch throws a runtime error like this: