question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error happened on fp16 cascade rcnn

See original GitHub issue

Describe the bug I use fp16 Faster RCNN normally. When I add ‘fp16 = dict(loss_scale=512.)’ in config cascade_rcnn_r101_fpn_1x.py, the error happened. The Error Info:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [12,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
...
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "tools/train.py", line 124, in <module>
    main()
  File "tools/train.py", line 120, in main
    timestamp=timestamp)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 133, in train_detector
    timestamp=timestamp)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 319, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/runner.py", line 363, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/runner.py", line 267, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 100, in batch_processor
    losses = model(**data)
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/detectors/base.py", line 138, in forward
    return self.forward_train(img, img_meta, **kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/detectors/cascade_rcnn.py", line 203, in forward_train
    proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 276, in get_bboxes
    scale_factor, cfg, rescale)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
    self.target_stds, img_shape)
  File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/bbox/transforms.py", line 78, in delta2bbox
    means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fa86a216813 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x16126 (0x7fa86a451126 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x16b11 (0x7fa86a451b11 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7fa86a206f0d in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x4b1bb2 (0x7fa86ab1abb2 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x4b1bf6 (0x7fa86ab1abf6 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xe7 (0x7fa86f3dcb97 in /lib/x86_64-linux-gnu/libc.so.6)

Environment Ubuntu 18.04 Pytorch 1.3.0 CUDA 10.1 GCC 5.5.0 RTX 2080TI

Thank you for your reply

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
Dragonssoncommented, Jan 10, 2020

I use the FocalLoss,and had met the same error.But when I change the FocalLoss to CrossEntropyLoss,the error disappear.Besides,when I delete the fp16,the error also disappear. However,I don’t know what cause this.

0reactions
sysuwsqcommented, Apr 27, 2021

so,how to solve this issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting - OpenVINO™ Documentation
When the model server starts successfully and all the models are imported, there could be a couple of reasons for errors in the...
Read more >
Cascade R-CNN: Delving Into High ... - CVF Open Access
The problem is that the distribution of hypotheses out of a proposal detector is usually heavily imbalanced towards low quality. In general, forcing...
Read more >
latest PDF - MMDetection's documentation!
We can use the COCO pretrained Cascade Mask R-CNN R50 model for more stable␣ ... Fix YOLOv3 FP16 training error (#5172).
Read more >
Scaled YOLO v4 is the best neural network for object detection ...
... Amazon Cascade-RCNN ResNest200; Microsoft RepPoints v2 ... using TensorRT + tkDNN (batch = 4, FP16): https://github.com/ceccocats/tkDNN.
Read more >
Creating a Real-Time License Plate Detection and ...
Workflow uses three cascaded models starting with vehicle detection, license plate ... Currently, LPR only supports FP32 and FP16 precision.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found