Error happened on fp16 cascade rcnn
See original GitHub issueDescribe the bug I use fp16 Faster RCNN normally. When I add ‘fp16 = dict(loss_scale=512.)’ in config cascade_rcnn_r101_fpn_1x.py, the error happened. The Error Info:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [12,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
...
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "tools/train.py", line 124, in <module>
main()
File "tools/train.py", line 120, in main
timestamp=timestamp)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 133, in train_detector
timestamp=timestamp)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 319, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/runner.py", line 363, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/runner.py", line 267, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/apis/train.py", line 100, in batch_processor
losses = model(**data)
File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/detectors/base.py", line 138, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/detectors/cascade_rcnn.py", line 203, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 276, in get_bboxes
scale_factor, cfg, rescale)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/media/zpf/project/mmdetection/mmdetection/mmdet/core/bbox/transforms.py", line 78, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fa86a216813 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x16126 (0x7fa86a451126 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x16b11 (0x7fa86a451b11 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7fa86a206f0d in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x4b1bb2 (0x7fa86ab1abb2 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x4b1bf6 (0x7fa86ab1abf6 in /home/zpf/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xe7 (0x7fa86f3dcb97 in /lib/x86_64-linux-gnu/libc.so.6)
Environment Ubuntu 18.04 Pytorch 1.3.0 CUDA 10.1 GCC 5.5.0 RTX 2080TI
Thank you for your reply
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Troubleshooting - OpenVINO™ Documentation
When the model server starts successfully and all the models are imported, there could be a couple of reasons for errors in the...
Read more >Cascade R-CNN: Delving Into High ... - CVF Open Access
The problem is that the distribution of hypotheses out of a proposal detector is usually heavily imbalanced towards low quality. In general, forcing...
Read more >latest PDF - MMDetection's documentation!
We can use the COCO pretrained Cascade Mask R-CNN R50 model for more stable␣ ... Fix YOLOv3 FP16 training error (#5172).
Read more >Scaled YOLO v4 is the best neural network for object detection ...
... Amazon Cascade-RCNN ResNest200; Microsoft RepPoints v2 ... using TensorRT + tkDNN (batch = 4, FP16): https://github.com/ceccocats/tkDNN.
Read more >Creating a Real-Time License Plate Detection and ...
Workflow uses three cascaded models starting with vehicle detection, license plate ... Currently, LPR only supports FP32 and FP16 precision.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I use the FocalLoss,and had met the same error.But when I change the FocalLoss to CrossEntropyLoss,the error disappear.Besides,when I delete the fp16,the error also disappear. However,I don’t know what cause this.
so,how to solve this issue?