RuntimeError: copy_if failed to synchronize: device-side assert triggered
See original GitHub issue🐛 Bug
… /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: bl ock: [0,0,0], thread: [41,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bou nds"
failed. …
Traceback (most recent call last): File “tools/train_net.py”, line 174, in <module> main() File “tools/train_net.py”, line 167, in main model = train(cfg, args.local_rank, args.distributed) File “tools/train_net.py”, line 73, in train arguments, File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/engine/trainer.py”, line 66, in do_train loss_dict = model(images, targets) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py”, line 50, in forward proposals, proposal_losses = self.rpn(images, features, targets) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/rpn.py”, line 159, in forward return self._forward_train(anchors, objectness, rpn_box_regression, targets) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/rpn.py”, line 175, in _forward_train anchors, objectness, rpn_box_regression, targets File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/inference.py”, line 138, in forward sampled_boxes.append(self.forward_for_single_feature_map(a, o, b)) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/inference.py”, line 113, in forward_for_single_feat ure_map boxlist = remove_small_boxes(boxlist, self.min_size) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/structures/boxlist_ops.py”, line 47, in remove_small_boxes (ws >= min_size) & (hs >= min_size) RuntimeError: copy_if failed to synchronize: device-side assert triggered
This may be similar to https://github.com/facebookresearch/maskrcnn-benchmark/issues/229 but the message is slightly different. 229 is an illegal memory access was encountered
but what I met is device-side assert triggered
.
I have changed the NUM_CLASSES
as well.
To Reproduce
Steps to reproduce the behavior:
Run training code
Expected behavior
No error
Environment
PyTorch version: 1.0.0.dev20190409 Is debug build: No CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 16.04.4 LTS GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010 CMake version: version 3.5.1
Python version: 3.5 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: TITAN X (Pascal)
Nvidia driver version: 418.39 cuDNN version: Could not collect
Versions of relevant libraries: [pip] Could not collect [conda] Could not collect Pillow (6.0.0)
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
Having learning rate that is too large is indeed the problem. Lowering the learning rate solves the problem.
hello, I have met the seem issue, then i reduce the learning rate, but i can’t reslove it. so could you help me to reslove the issue, thanks!
the error in below: File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\engine\trainer.py”, line 88, in do_train loss_dict = model(images, targets) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py”, line 491, in call result = self.forward(*input, **kwargs) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\apex-0.1-py3.7-win-amd64.egg\apex\amp_initialize.py”, line 194, in new_fwd **applier(kwargs, input_caster)) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\detector\generalized_rcnn.py”, line 60, in forward x, result, detector_losses = self.roi_heads(features, proposals, targets) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py”, line 491, in call result = self.forward(*input, **kwargs) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\roi_heads.py”, line 26, in forward x, detections, loss_box = self.box(features, proposals, targets) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py”, line 491, in call result = self.forward(*input, **kwargs) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\box_head\box_head.py”, line 56, in forward [class_logits], [box_regression] File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\box_head\loss.py”, line 151, in call sampled_pos_inds_subset = torch.nonzero(labels > 0).squeeze(1) RuntimeError: copy_if failed to synchronize: device-side assert triggered