question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[fp16 training error] CUDA error: device-side assert triggered

See original GitHub issue

Checklist

  • [O ] I have searched related issues but could not get the expected help.
  • [O ] The bug has not been fixed in the latest version.

Describe the bug A clear and concise description of what the bug is. If there are any related issues or upstream bugs, please also refer to them.

Error traceback

  1. What command or script did you run?
I run the following command to train mask_rcnn_r50_fpn_fp16
==============================================
NUM_GPUS=4
CONFIG=mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py'
WORK_DIR=work_dirs/mask_rcnn_r50_fpn_fp16_1x' 

tools/dist_train.sh $CONFIG $NUM_GPUS --validate --work_dir $WORK_DIR
==============================================
  1. If applicable, paste the error trackback here using code blocks.
Because it is too long, i will paste it in the end.

Reproduction details

  1. Did you make any modifications on the code? Did you understand what you have modified? No

  2. What dataset did you use? COCO

Environment

  • OS: Ubuntu 16.04.4
  • GCC 5.4.0
  • PyTorch version 1.1.0
  • How you installed PyTorch : conda (inside docker)
  • GPU model : V100 32GB (NVLink)
  • CUDA and CUDNN version : CUDA 9.0 , cuDNN 7

When I try to train fp 16 model, CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564)

and many repetitive following messages /tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [80,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

when i comment out fp16 configuration , it doesn’t produce error. https://github.com/open-mmlab/mmdetection/blob/master/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py#L2

Error message ``/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x Directory exists loading annotations into memory… 2019-07-02 00:56:29,056 - INFO - Distributed training: True 2019-07-02 00:56:29,549 - INFO - load model from: modelzoo://resnet50 loading annotations into memory… 2019-07-02 00:56:29,828 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer3.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer1.2.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, bn1.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer2.2.bn3.num_batches_tracked

loading annotations into memory… loading annotations into memory… Done (t=12.76s) creating index… Done (t=12.47s) creating index… index created! Done (t=12.82s) creating index… index created! index created! Done (t=13.82s) creating index… index created! loading annotations into memory… loading annotations into memory… loading annotations into memory… loading annotations into memory… Done (t=1.77s) creating index… index created! Done (t=2.36s) creating index… Done (t=2.39s) creating index… index created! index created! Done (t=2.53s) creating index… index created! 2019-07-02 00:56:53,981 - INFO - Start running, host: root@b6940c72ef4f, work_dir: /home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x 2019-07-02 00:56:53,981 - INFO - workflow: [(‘train’, 1)], max: 12 epochs /tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [96,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [97,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [98,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. … omitted… /tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [126,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. Traceback (most recent call last): File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 98, in <module> main() File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 94, in main logger=logger) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 60, in train_detector _dist_train(model, dataset, cfg, validate=validate) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 189, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 356, in run epoch_runner(data_loaders[i], **kwargs) File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 262, in train self.model, data_batch, train_mode=True, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 40, in batch_processor losses = model(**data) File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py”, line 50, in forward return self.module(*inputs[0], **kwargs[0]) File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 75, in new_func output = old_func(*new_args, **new_kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py”, line 86, in forward return self.forward_train(img, img_meta, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py”, line 114, in forward_train proposal_list = self.rpn_head.get_bboxes(*proposal_inputs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 152, in new_func output = old_func(*new_args, **new_kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py”, line 221, in get_bboxes scale_factor, cfg, rescale) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py”, line 83, in get_bboxes_single self.target_stds, img_shape) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py”, line 40, in delta2bbox means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4) RuntimeError: CUDA error: device-side assert triggered Traceback (most recent call last): File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 98, in <module> main() File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 94, in main logger=logger) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 60, in train_detector _dist_train(model, dataset, cfg, validate=validate) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 189, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 356, in run epoch_runner(data_loaders[i], **kwargs) File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 262, in train self.model, data_batch, train_mode=True, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 40, in batch_processor losses = model(**data) File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py”, line 50, in forward return self.module(*inputs[0], **kwargs[0]) File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 75, in new_func output = old_func(*new_args, **new_kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py”, line 86, in forward return self.forward_train(img, img_meta, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py”, line 114, in forward_train proposal_list = self.rpn_head.get_bboxes(*proposal_inputs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 152, in new_func output = old_func(*new_args, **new_kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py”, line 221, in get_bboxes scale_factor, cfg, rescale) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py”, line 83, in get_bboxes_single self.target_stds, img_shape) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py”, line 40, in delta2bbox means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4) RuntimeError: CUDA error: device-side assert triggered Traceback (most recent call last): File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 98, in <module> main() File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 94, in main logger=logger) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 60, in train_detector _dist_train(model, dataset, cfg, validate=validate) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 189, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 356, in run epoch_runner(data_loaders[i], **kwargs) File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 262, in train self.model, data_batch, train_mode=True, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 40, in batch_processor losses = model(**data) File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py”, line 50, in forward return self.module(*inputs[0], **kwargs[0]) File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 75, in new_func output = old_func(*new_args, **new_kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py”, line 86, in forward return self.forward_train(img, img_meta, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py”, line 114, in forward_train proposal_list = self.rpn_head.get_bboxes(*proposal_inputs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 152, in new_func output = old_func(*new_args, **new_kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py”, line 221, in get_bboxes scale_factor, cfg, rescale) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py”, line 83, in get_bboxes_single self.target_stds, img_shape) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py”, line 40, in delta2bbox means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4) RuntimeError: CUDA error: device-side assert triggered Traceback (most recent call last): File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 98, in <module> main() File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 94, in main logger=logger) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 60, in train_detector _dist_train(model, dataset, cfg, validate=validate) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 189, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 356, in run epoch_runner(data_loaders[i], **kwargs) File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 262, in train self.model, data_batch, train_mode=True, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 40, in batch_processor losses = model(**data) File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py”, line 50, in forward return self.module(*inputs[0], **kwargs[0]) File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 75, in new_func output = old_func(*new_args, **new_kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py”, line 86, in forward return self.forward_train(img, img_meta, **kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py”, line 114, in forward_train proposal_list = self.rpn_head.get_bboxes(*proposal_inputs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 152, in new_func output = old_func(*new_args, **new_kwargs) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py”, line 221, in get_bboxes scale_factor, cfg, rescale) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py”, line 83, in get_bboxes_single self.target_stds, img_shape) File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py”, line 40, in delta2bbox means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4) RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of ‘c10::Error’ what(): CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fe9d572d66a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x140e0 (0x7fe9cf61b0e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fe9d571b661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fe9d4d160ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x1333fb (0x7fe9ed5f13fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x352ae4 (0x7fe9ed810ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x352b41 (0x7fe9ed810b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x19dbbc (0x5575e53ecbbc in /opt/conda/bin/python) frame #8: <unknown function> + 0xf32a8 (0x5575e53422a8 in /opt/conda/bin/python) frame #9: <unknown function> + 0xf343a (0x5575e534243a in /opt/conda/bin/python) frame #10: <unknown function> + 0xf2c77 (0x5575e5341c77 in /opt/conda/bin/python) frame #11: <unknown function> + 0xf2b07 (0x5575e5341b07 in /opt/conda/bin/python) frame #12: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python) frame #13: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python) frame #14: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python) frame #15: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python) frame #16: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python) frame #17: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python) frame #18: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python) frame #19: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python) frame #20: PyDict_SetItem + 0x3da (0x5575e5387d4a in /opt/conda/bin/python) frame #21: PyDict_SetItemString + 0x4f (0x5575e539084f in /opt/conda/bin/python) frame #22: PyImport_Cleanup + 0x99 (0x5575e53f6b79 in /opt/conda/bin/python) frame #23: Py_FinalizeEx + 0x61 (0x5575e5461961 in /opt/conda/bin/python) frame #24: Py_Main + 0x355 (0x5575e546beb5 in /opt/conda/bin/python) frame #25: main + 0xee (0x5575e5333b4e in /opt/conda/bin/python) frame #26: __libc_start_main + 0xf0 (0x7fea04b54830 in /lib/x86_64-linux-gnu/libc.so.6) frame #27: <unknown function> + 0x1c61a8 (0x5575e54151a8 in /opt/conda/bin/python)

terminate called after throwing an instance of ‘c10::Error’ what(): CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f6c09f7766a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x140e0 (0x7f6c03e650e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f6c09f65661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f6c095600ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x1333fb (0x7f6c21e3b3fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x352ae4 (0x7f6c2205aae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x352b41 (0x7f6c2205ab41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x19dbbc (0x56145130ebbc in /opt/conda/bin/python) frame #8: <unknown function> + 0xf32a8 (0x5614512642a8 in /opt/conda/bin/python) frame #9: <unknown function> + 0xf343a (0x56145126443a in /opt/conda/bin/python) frame #10: <unknown function> + 0xf2c77 (0x561451263c77 in /opt/conda/bin/python) frame #11: <unknown function> + 0xf2b07 (0x561451263b07 in /opt/conda/bin/python) frame #12: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #13: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #14: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #15: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #16: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #17: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #18: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #19: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #20: PyDict_SetItem + 0x3da (0x5614512a9d4a in /opt/conda/bin/python) frame #21: PyDict_SetItemString + 0x4f (0x5614512b284f in /opt/conda/bin/python) frame #22: PyImport_Cleanup + 0x99 (0x561451318b79 in /opt/conda/bin/python) frame #23: Py_FinalizeEx + 0x61 (0x561451383961 in /opt/conda/bin/python) frame #24: Py_Main + 0x355 (0x56145138deb5 in /opt/conda/bin/python) frame #25: main + 0xee (0x561451255b4e in /opt/conda/bin/python) frame #26: __libc_start_main + 0xf0 (0x7f6c3939e830 in /lib/x86_64-linux-gnu/libc.so.6) frame #27: <unknown function> + 0x1c61a8 (0x5614513371a8 in /opt/conda/bin/python)

terminate called after throwing an instance of ‘c10::Error’ what(): CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fa03010666a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x140e0 (0x7fa029ff40e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fa0300f4661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fa02f6ef0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x1333fb (0x7fa047fca3fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x352ae4 (0x7fa0481e9ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x352b41 (0x7fa0481e9b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x19dbbc (0x564f6c1dabbc in /opt/conda/bin/python) frame #8: <unknown function> + 0xf32a8 (0x564f6c1302a8 in /opt/conda/bin/python) frame #9: <unknown function> + 0xf343a (0x564f6c13043a in /opt/conda/bin/python) frame #10: <unknown function> + 0xf2c77 (0x564f6c12fc77 in /opt/conda/bin/python) frame #11: <unknown function> + 0xf2b07 (0x564f6c12fb07 in /opt/conda/bin/python) frame #12: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #13: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #14: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #15: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #16: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #17: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #18: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #19: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #20: PyDict_SetItem + 0x3da (0x564f6c175d4a in /opt/conda/bin/python) frame #21: PyDict_SetItemString + 0x4f (0x564f6c17e84f in /opt/conda/bin/python) frame #22: PyImport_Cleanup + 0x99 (0x564f6c1e4b79 in /opt/conda/bin/python) frame #23: Py_FinalizeEx + 0x61 (0x564f6c24f961 in /opt/conda/bin/python) frame #24: Py_Main + 0x355 (0x564f6c259eb5 in /opt/conda/bin/python) frame #25: main + 0xee (0x564f6c121b4e in /opt/conda/bin/python) frame #26: __libc_start_main + 0xf0 (0x7fa05f52d830 in /lib/x86_64-linux-gnu/libc.so.6) frame #27: <unknown function> + 0x1c61a8 (0x564f6c2031a8 in /opt/conda/bin/python)

terminate called after throwing an instance of ‘c10::Error’ what(): CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f23624b566a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x140e0 (0x7f235c3a30e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f23624a3661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f2361a9e0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x1333fb (0x7f237a3793fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x352ae4 (0x7f237a598ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x352b41 (0x7f237a598b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x19dbbc (0x55f393649bbc in /opt/conda/bin/python) frame #8: <unknown function> + 0xf32a8 (0x55f39359f2a8 in /opt/conda/bin/python) frame #9: <unknown function> + 0xf343a (0x55f39359f43a in /opt/conda/bin/python) frame #10: <unknown function> + 0xf2c77 (0x55f39359ec77 in /opt/conda/bin/python) frame #11: <unknown function> + 0xf2b07 (0x55f39359eb07 in /opt/conda/bin/python) frame #12: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #13: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #14: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #15: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #16: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #17: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #18: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #19: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #20: PyDict_SetItem + 0x3da (0x55f3935e4d4a in /opt/conda/bin/python) frame #21: PyDict_SetItemString + 0x4f (0x55f3935ed84f in /opt/conda/bin/python) frame #22: PyImport_Cleanup + 0x99 (0x55f393653b79 in /opt/conda/bin/python) frame #23: Py_FinalizeEx + 0x61 (0x55f3936be961 in /opt/conda/bin/python) frame #24: Py_Main + 0x355 (0x55f3936c8eb5 in /opt/conda/bin/python) frame #25: main + 0xee (0x55f393590b4e in /opt/conda/bin/python) frame #26: __libc_start_main + 0xf0 (0x7f23918dc830 in /lib/x86_64-linux-gnu/libc.so.6) frame #27: <unknown function> + 0x1c61a8 (0x55f3936721a8 in /opt/conda/bin/python)

Traceback (most recent call last): File “/opt/conda/lib/python3.6/runpy.py”, line 193, in _run_module_as_main main, mod_spec) File “/opt/conda/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py”, line 235, in <module> main() File “/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py”, line 231, in main cmd=process.args) subprocess.CalledProcessError: Command ‘[’/opt/conda/bin/python’, ‘-u’, ‘/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py’, ‘–local_rank=0’, ‘/home/user/Desktop/workspace_zacurr/mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py’, ‘–launcher’, ‘pytorch’, ‘–validate’, ‘–work_dir’, ‘/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x’]’ died with <Signals.SIGABRT: 6>.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:23

github_iconTop GitHub Comments

2reactions
BlakeXiaochucommented, Aug 1, 2019

@gittigxuy sorry for late response! I have just solved the problem, I found that is caused by the mismatch among the numbers of gt_bboxes, gt_labels and gt_masks. I filtered some bboxes out of the cropping range when applying crop operation, but forgot filtering the gt_labels and gt_masks. So I guess your problem is caused by the same reason?

1reaction
yhcao6commented, Aug 2, 2019

I have sent the code to your gmail,waiting for your reply.Thank

def clip_box(bbox, clip_box, alpha):
    ar_ = (bbox_area(bbox))
    x_min = np.maximum(bbox[:, 0], clip_box[0]).reshape(-1, 1)
    y_min = np.maximum(bbox[:, 1], clip_box[1]).reshape(-1, 1)
    x_max = np.minimum(bbox[:, 2], clip_box[2]).reshape(-1, 1)
    y_max = np.minimum(bbox[:, 3], clip_box[3]).reshape(-1, 1)

    bbox = np.hstack((x_min, y_min, x_max, y_max, bbox[:, 4:]))

    delta_area = ((ar_ - bbox_area(bbox)) / ar_)

    mask = (delta_area < (1 - alpha)).astype(int)

    bbox = bbox[mask == 1, :]

    return bbox

This is the clip_box function in your code, which may delete some gt boxes. However, you forget to delete the corresponding gt labels.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA Error: Device-Side Assert Triggered: Solved | Built In
A CUDA Error: Device-Side Assert Triggered can either be caused by an inconsistency between the number of labels and output units or an ......
Read more >
RuntimeError: CUDA error: device-side assert triggered when ...
Hi! I'm trying to follow the transformers tutorial of FastAIv2 but with data from the Arxiv dataset in Kaggle. I get a RuntimeError:...
Read more >
CUDA error: device-side assert triggered 解决方法_JackHu ...
使用fp16的时候,容易报上面的错误. 解决方法:. (1)检查自己的代码实现,数组是否越界. BCELoss之前有没有转到0~1之间.
Read more >
runtimeerror: cuda error: cublas_status_execution_failed - You.com ...
Everything works fine until today where while training outputs error as ... on terminal outputs RuntimeError: CUDA error: device-side assert triggered.
Read more >
CUDA error: device-side assert triggered on Colab
While I tried your code, and it did not give me an error, I can say that usually the best practice to debug...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found