Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error happened on fp16 cascade rcnn

See original GitHub issue

Describe the bug I use fp16 Faster RCNN normally. When I add ‘fp16 = dict(loss_scale=512.)’ in config cascade_rcnn_r101_fpn_1x.py, the error happened. The Error Info:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [12,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
...
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "tools/train.py", line 124, in <module>
    main()
  File "tools/train.py", line 120, in main
    timestamp=timestamp)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 133, in train_detector
    timestamp=timestamp)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 319, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/runner.py", line 363, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/runner.py", line 267, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 100, in batch_processor
    losses = model(**data)
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/detectors/base.py", line 138, in forward
    return self.forward_train(img, img_meta, **kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/detectors/cascade_rcnn.py", line 203, in forward_train
    proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 276, in get_bboxes
    scale_factor, cfg, rescale)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
    self.target_stds, img_shape)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/bbox/transforms.py", line 78, in delta2bbox
    means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fa86a216813 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x16126 (0x7fa86a451126 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x16b11 (0x7fa86a451b11 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7fa86a206f0d in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x4b1bb2 (0x7fa86ab1abb2 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x4b1bf6 (0x7fa86ab1abf6 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xe7 (0x7fa86f3dcb97 in /lib/x86_64-linux-gnu/libc.so.6)

Environment Ubuntu 18.04 Pytorch 1.3.0 CUDA 10.1 GCC 5.5.0 RTX 2080TI

Thank you for your reply

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

2reactions

Dragonssoncommented, Jan 10, 2020

I use the FocalLoss,and had met the same error.But when I change the FocalLoss to CrossEntropyLoss，the error disappear.Besides,when I delete the fp16,the error also disappear. However,I don’t know what cause this.

0reactions

sysuwsqcommented, Apr 27, 2021

so,how to solve this issue?

Top Results From Across the Web

Troubleshooting - OpenVINO™ Documentation

When the model server starts successfully and all the models are imported, there could be a couple of reasons for errors in the...

Cascade R-CNN: Delving Into High ... - CVF Open Access

The problem is that the distribution of hypotheses out of a proposal detector is usually heavily imbalanced towards low quality. In general, forcing...

latest PDF - MMDetection's documentation!

We can use the COCO pretrained Cascade Mask R-CNN R50 model for more stable␣ ... Fix YOLOv3 FP16 training error (#5172).

Scaled YOLO v4 is the best neural network for object detection ...

... Amazon Cascade-RCNN ResNest200; Microsoft RepPoints v2 ... using TensorRT + tkDNN (batch = 4, FP16): https://github.com/ceccocats/tkDNN.

Creating a Real-Time License Plate Detection and ...

Workflow uses three cascaded models starting with vehicle detection, license plate ... Currently, LPR only supports FP32 and FP16 precision.