CUDA RuntimeError during losses.backward()
See original GitHub issue❓ Questions and Help
Hi. Thank you for your great efforts.
I’m trying to use my own dataset which only has a single class. I implemented e.g. maskrcnn/data/datasets/mydata.py following README, set ROI_BOX_HEAD.NUM_CLASSES=2, and updated some other files correspondingly. (Detailed description is omitted since the error can be simply reproduced in a different way as described below)
Error message
Traceback (most recent call last):
File "tools/train_net.py", line 174, in <module>
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 76, in do_train
losses.backward()
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 106, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch-nightly_1549566635986/work/aten/src/THC/THCBlas.cu:259
How to reproduce
-
Build docker image & setup maskrcnn-benchmark Due to #167, I commented out https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/docker/Dockerfile#L51 And run
nvidia-docker build -t maskrcnn-benchmark docker/
Afterward, runpython setup.py build develop
inside the container -
Add this line
target.extra_fields['labels'].clamp_(0, 1)
above here: https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/maskrcnn_benchmark/data/datasets/coco.py#L91-L96 This will reduce 80 classes into a single foreground class -
Place COCO dataset in /maskrcnn-benchmark/datasets/coco
-
Run (single GPU)
python tools/train_net.py \
--config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" \
MODEL.ROI_BOX_HEAD.NUM_CLASSES 2 \
SOLVER.IMS_PER_BATCH 2
I found that increasing NUM_CLASSES makes a few iterations successful. What am I missing? Please help!
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:6 (5 by maintainers)
Top GitHub Comments
For anyone else running into this issue, I was able to solve it by installing the latest pytorch (pytorch-nightly still seemed to give this error, even recent versions, but I may have just been screwing something up), and then reinstalling this library. I didn’t reinstall the library at first, and it caused some headaches.
In short,
Again, for me, it’s important to run that last step. Seems kinda obvious in hindsight, but not at the time 😃
Yes. It turned out to be a problem with PyTorch-nightly. I was using it because Docker installs it.
The previous version of PyTorch was
After I replaced this line https://github.com/facebookresearch/maskrcnn-benchmark/blob/327bc29bcc4924e35bd61c59877d5a1d25bb75af/docker/Dockerfile#L35 with
RUN conda install -y pytorch -c pytorch \
, the problem was resolved.Now the PyTorch version is
Thanks a lot!