[fp16 training error] CUDA error: device-side assert triggered
See original GitHub issueChecklist
- [O ] I have searched related issues but could not get the expected help.
- [O ] The bug has not been fixed in the latest version.
Describe the bug A clear and concise description of what the bug is. If there are any related issues or upstream bugs, please also refer to them.
Error traceback
- What command or script did you run?
I run the following command to train mask_rcnn_r50_fpn_fp16
==============================================
NUM_GPUS=4
CONFIG=mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py'
WORK_DIR=work_dirs/mask_rcnn_r50_fpn_fp16_1x'
tools/dist_train.sh $CONFIG $NUM_GPUS --validate --work_dir $WORK_DIR
==============================================
- If applicable, paste the error trackback here using code blocks.
Because it is too long, i will paste it in the end.
Reproduction details
-
Did you make any modifications on the code? Did you understand what you have modified? No
-
What dataset did you use? COCO
Environment
- OS: Ubuntu 16.04.4
- GCC 5.4.0
- PyTorch version 1.1.0
- How you installed PyTorch : conda (inside docker)
- GPU model : V100 32GB (NVLink)
- CUDA and CUDNN version : CUDA 9.0 , cuDNN 7
When I try to train fp 16 model, CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564)
and many repetitive following messages
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [80,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
when i comment out fp16 configuration , it doesn’t produce error. https://github.com/open-mmlab/mmdetection/blob/master/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py#L2
Error message ``/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x Directory exists loading annotations into memory… 2019-07-02 00:56:29,056 - INFO - Distributed training: True 2019-07-02 00:56:29,549 - INFO - load model from: modelzoo://resnet50 loading annotations into memory… 2019-07-02 00:56:29,828 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer3.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer1.2.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, bn1.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer2.2.bn3.num_batches_tracked
loading annotations into memory…
loading annotations into memory…
Done (t=12.76s)
creating index…
Done (t=12.47s)
creating index…
index created!
Done (t=12.82s)
creating index…
index created!
index created!
Done (t=13.82s)
creating index…
index created!
loading annotations into memory…
loading annotations into memory…
loading annotations into memory…
loading annotations into memory…
Done (t=1.77s)
creating index…
index created!
Done (t=2.36s)
creating index…
Done (t=2.39s)
creating index…
index created!
index created!
Done (t=2.53s)
creating index…
index created!
2019-07-02 00:56:53,981 - INFO - Start running, host: root@b6940c72ef4f, work_dir: /home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x
2019-07-02 00:56:53,981 - INFO - workflow: [(‘train’, 1)], max: 12 epochs
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [96,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [97,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [98,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
… omitted…
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [126,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
Traceback (most recent call last):
File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 98, in <module>
main()
File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 94, in main
logger=logger)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 262, in train
self.model, data_batch, train_mode=True, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 40, in batch_processor
losses = model(**data)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py”, line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call
result = self.forward(*input, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py”, line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py”, line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py”, line 221, in get_bboxes
scale_factor, cfg, rescale)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py”, line 83, in get_bboxes_single
self.target_stds, img_shape)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py”, line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 98, in <module>
main()
File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 94, in main
logger=logger)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 262, in train
self.model, data_batch, train_mode=True, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 40, in batch_processor
losses = model(**data)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py”, line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call
result = self.forward(*input, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py”, line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py”, line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py”, line 221, in get_bboxes
scale_factor, cfg, rescale)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py”, line 83, in get_bboxes_single
self.target_stds, img_shape)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py”, line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 98, in <module>
main()
File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 94, in main
logger=logger)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 262, in train
self.model, data_batch, train_mode=True, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 40, in batch_processor
losses = model(**data)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py”, line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call
result = self.forward(*input, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py”, line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py”, line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py”, line 221, in get_bboxes
scale_factor, cfg, rescale)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py”, line 83, in get_bboxes_single
self.target_stds, img_shape)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py”, line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 98, in <module>
main()
File “/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py”, line 94, in main
logger=logger)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py”, line 262, in train
self.model, data_batch, train_mode=True, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py”, line 40, in batch_processor
losses = model(**data)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py”, line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File “/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 494, in call
result = self.forward(*input, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py”, line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py”, line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py”, line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py”, line 221, in get_bboxes
scale_factor, cfg, rescale)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py”, line 83, in get_bboxes_single
self.target_stds, img_shape)
File “/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py”, line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fe9d572d66a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x140e0 (0x7fe9cf61b0e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fe9d571b661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fe9d4d160ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x1333fb (0x7fe9ed5f13fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x352ae4 (0x7fe9ed810ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x352b41 (0x7fe9ed810b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x19dbbc (0x5575e53ecbbc in /opt/conda/bin/python)
frame #8: <unknown function> + 0xf32a8 (0x5575e53422a8 in /opt/conda/bin/python)
frame #9: <unknown function> + 0xf343a (0x5575e534243a in /opt/conda/bin/python)
frame #10: <unknown function> + 0xf2c77 (0x5575e5341c77 in /opt/conda/bin/python)
frame #11: <unknown function> + 0xf2b07 (0x5575e5341b07 in /opt/conda/bin/python)
frame #12: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #13: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #14: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #15: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #16: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #17: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #18: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #19: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x5575e5387d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x5575e539084f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x5575e53f6b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x5575e5461961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x5575e546beb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x5575e5333b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fea04b54830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: <unknown function> + 0x1c61a8 (0x5575e54151a8 in /opt/conda/bin/python)
terminate called after throwing an instance of ‘c10::Error’ what(): CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f6c09f7766a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x140e0 (0x7f6c03e650e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f6c09f65661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f6c095600ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x1333fb (0x7f6c21e3b3fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x352ae4 (0x7f6c2205aae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x352b41 (0x7f6c2205ab41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x19dbbc (0x56145130ebbc in /opt/conda/bin/python) frame #8: <unknown function> + 0xf32a8 (0x5614512642a8 in /opt/conda/bin/python) frame #9: <unknown function> + 0xf343a (0x56145126443a in /opt/conda/bin/python) frame #10: <unknown function> + 0xf2c77 (0x561451263c77 in /opt/conda/bin/python) frame #11: <unknown function> + 0xf2b07 (0x561451263b07 in /opt/conda/bin/python) frame #12: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #13: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #14: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #15: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #16: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #17: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #18: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #19: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python) frame #20: PyDict_SetItem + 0x3da (0x5614512a9d4a in /opt/conda/bin/python) frame #21: PyDict_SetItemString + 0x4f (0x5614512b284f in /opt/conda/bin/python) frame #22: PyImport_Cleanup + 0x99 (0x561451318b79 in /opt/conda/bin/python) frame #23: Py_FinalizeEx + 0x61 (0x561451383961 in /opt/conda/bin/python) frame #24: Py_Main + 0x355 (0x56145138deb5 in /opt/conda/bin/python) frame #25: main + 0xee (0x561451255b4e in /opt/conda/bin/python) frame #26: __libc_start_main + 0xf0 (0x7f6c3939e830 in /lib/x86_64-linux-gnu/libc.so.6) frame #27: <unknown function> + 0x1c61a8 (0x5614513371a8 in /opt/conda/bin/python)
terminate called after throwing an instance of ‘c10::Error’ what(): CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fa03010666a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x140e0 (0x7fa029ff40e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fa0300f4661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fa02f6ef0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x1333fb (0x7fa047fca3fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x352ae4 (0x7fa0481e9ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x352b41 (0x7fa0481e9b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x19dbbc (0x564f6c1dabbc in /opt/conda/bin/python) frame #8: <unknown function> + 0xf32a8 (0x564f6c1302a8 in /opt/conda/bin/python) frame #9: <unknown function> + 0xf343a (0x564f6c13043a in /opt/conda/bin/python) frame #10: <unknown function> + 0xf2c77 (0x564f6c12fc77 in /opt/conda/bin/python) frame #11: <unknown function> + 0xf2b07 (0x564f6c12fb07 in /opt/conda/bin/python) frame #12: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #13: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #14: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #15: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #16: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #17: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #18: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #19: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python) frame #20: PyDict_SetItem + 0x3da (0x564f6c175d4a in /opt/conda/bin/python) frame #21: PyDict_SetItemString + 0x4f (0x564f6c17e84f in /opt/conda/bin/python) frame #22: PyImport_Cleanup + 0x99 (0x564f6c1e4b79 in /opt/conda/bin/python) frame #23: Py_FinalizeEx + 0x61 (0x564f6c24f961 in /opt/conda/bin/python) frame #24: Py_Main + 0x355 (0x564f6c259eb5 in /opt/conda/bin/python) frame #25: main + 0xee (0x564f6c121b4e in /opt/conda/bin/python) frame #26: __libc_start_main + 0xf0 (0x7fa05f52d830 in /lib/x86_64-linux-gnu/libc.so.6) frame #27: <unknown function> + 0x1c61a8 (0x564f6c2031a8 in /opt/conda/bin/python)
terminate called after throwing an instance of ‘c10::Error’ what(): CUDA error: device-side assert triggered (insert_events at …/c10/cuda/CUDACachingAllocator.cpp:564) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f23624b566a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x140e0 (0x7f235c3a30e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f23624a3661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f2361a9e0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x1333fb (0x7f237a3793fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x352ae4 (0x7f237a598ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x352b41 (0x7f237a598b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x19dbbc (0x55f393649bbc in /opt/conda/bin/python) frame #8: <unknown function> + 0xf32a8 (0x55f39359f2a8 in /opt/conda/bin/python) frame #9: <unknown function> + 0xf343a (0x55f39359f43a in /opt/conda/bin/python) frame #10: <unknown function> + 0xf2c77 (0x55f39359ec77 in /opt/conda/bin/python) frame #11: <unknown function> + 0xf2b07 (0x55f39359eb07 in /opt/conda/bin/python) frame #12: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #13: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #14: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #15: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #16: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #17: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #18: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #19: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python) frame #20: PyDict_SetItem + 0x3da (0x55f3935e4d4a in /opt/conda/bin/python) frame #21: PyDict_SetItemString + 0x4f (0x55f3935ed84f in /opt/conda/bin/python) frame #22: PyImport_Cleanup + 0x99 (0x55f393653b79 in /opt/conda/bin/python) frame #23: Py_FinalizeEx + 0x61 (0x55f3936be961 in /opt/conda/bin/python) frame #24: Py_Main + 0x355 (0x55f3936c8eb5 in /opt/conda/bin/python) frame #25: main + 0xee (0x55f393590b4e in /opt/conda/bin/python) frame #26: __libc_start_main + 0xf0 (0x7f23918dc830 in /lib/x86_64-linux-gnu/libc.so.6) frame #27: <unknown function> + 0x1c61a8 (0x55f3936721a8 in /opt/conda/bin/python)
Traceback (most recent call last):
File “/opt/conda/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main, mod_spec)
File “/opt/conda/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py”, line 235, in <module>
main()
File “/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py”, line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command ‘[’/opt/conda/bin/python’, ‘-u’, ‘/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py’, ‘–local_rank=0’, ‘/home/user/Desktop/workspace_zacurr/mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py’, ‘–launcher’, ‘pytorch’, ‘–validate’, ‘–work_dir’, ‘/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x’]’ died with <Signals.SIGABRT: 6>.
Issue Analytics
- State:
- Created 4 years ago
- Comments:23
Top GitHub Comments
@gittigxuy sorry for late response! I have just solved the problem, I found that is caused by the mismatch among the numbers of gt_bboxes, gt_labels and gt_masks. I filtered some bboxes out of the cropping range when applying crop operation, but forgot filtering the gt_labels and gt_masks. So I guess your problem is caused by the same reason?
This is the
clip_box
function in your code, which may delete some gt boxes. However, you forget to delete the corresponding gt labels.