copy_if failed to synchronize: device-side assert triggered
See original GitHub issue❓ Questions and Help
I was training a customized module in fbnet and encountered the following error:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [40,0,0], thread: [104,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [40,0,0], thread: [105,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [40,0,0], thread: [106,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [40,0,0], thread: [107,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "tools/train_net.py", line 186, in <module>
main()
File "tools/train_net.py", line 179, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 85, in train
arguments,
File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 67, in do_train
loss_dict = model(images, targets)
File "/data/ryancheng/miniconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/data/ryancheng/miniconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/amp/_initialize.py", line 194, in new_fwd
**applier(kwargs, input_caster))
File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/data/ryancheng/miniconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 159, in forward
return self._forward_train(anchors, objectness, rpn_box_regression, targets)
File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 175, in _forward_train
anchors, objectness, rpn_box_regression, targets
File "/data/ryancheng/miniconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 140, in forward
sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 115, in forward_for_single_feature_map
boxlist = remove_small_boxes(boxlist, self.min_size)
File "/data/ryancheng/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 46, in remove_small_boxes
(ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
Any idea on what this error is caused by? Thanks in advance!
Issue Analytics
- State:
- Created 4 years ago
- Comments:7
Top Results From Across the Web
PyTorch: copy_if failed to synchronize: device-side assert ...
Sometimes when we run code using cuda, it gives error message having device-side assert triggered which hides the real error message.
Read more >RuntimeError: copy_if failed to synchronize: device ... - GitHub
229 is an illegal memory access was encountered but what I met is device-side assert triggered . I have changed the NUM_CLASSES as...
Read more >RuntimeError: copy_if failed to synchronize ... - PyTorch Forums
I'm getting the following errors with my code. It is an adapted version of the PyTorch DQN example.
Read more >CUDA Error: Device-Side Assert Triggered: Solved | Built In
A CUDA Error: Device-Side Assert Triggered can either be caused by an inconsistency between the number of labels and output units or an ......
Read more >Tpetra: Run-time error in idot unit test, in CUDA build only
I suspect the issue is that KokkosBlas::dot uses its X vector input to determine the execution space, but a raw pointer result argument...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
i encountered this issue when i set wrong num of class.It solved by correct the output num_class
same error
if the problem happened in training, smaller learning rate helps but I’ve also encountered this error while testing once… while testing, that’s not possible to solve it by using smaller learning rate, right?
I’ve tried to debug, but I can’t even access the “boxlist”. Runtime error happened when I try to print the “boxlist”.
I really want to know if anybody had another solution rather than just “using smaller learning rate”…