question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: copy_if failed to synchronize: device-side assert triggered

See original GitHub issue

🐛 Bug

… /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: bl ock: [0,0,0], thread: [41,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bou nds" failed. …

Traceback (most recent call last): File “tools/train_net.py”, line 174, in <module> main() File “tools/train_net.py”, line 167, in main model = train(cfg, args.local_rank, args.distributed) File “tools/train_net.py”, line 73, in train arguments, File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/engine/trainer.py”, line 66, in do_train loss_dict = model(images, targets) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py”, line 50, in forward proposals, proposal_losses = self.rpn(images, features, targets) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/rpn.py”, line 159, in forward return self._forward_train(anchors, objectness, rpn_box_regression, targets) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/rpn.py”, line 175, in _forward_train anchors, objectness, rpn_box_regression, targets File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 494, in call result = self.forward(*input, **kwargs) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/inference.py”, line 138, in forward sampled_boxes.append(self.forward_for_single_feature_map(a, o, b)) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/inference.py”, line 113, in forward_for_single_feat ure_map boxlist = remove_small_boxes(boxlist, self.min_size) File “/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5 -linux-x86_64.egg/maskrcnn_benchmark/structures/boxlist_ops.py”, line 47, in remove_small_boxes (ws >= min_size) & (hs >= min_size) RuntimeError: copy_if failed to synchronize: device-side assert triggered

This may be similar to https://github.com/facebookresearch/maskrcnn-benchmark/issues/229 but the message is slightly different. 229 is an illegal memory access was encountered but what I met is device-side assert triggered.

I have changed the NUM_CLASSES as well.

To Reproduce

Steps to reproduce the behavior:

Run training code

Expected behavior

No error

Environment

PyTorch version: 1.0.0.dev20190409 Is debug build: No CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.4 LTS GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010 CMake version: version 3.5.1

Python version: 3.5 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: TITAN X (Pascal)

Nvidia driver version: 418.39 cuDNN version: Could not collect

Versions of relevant libraries: [pip] Could not collect [conda] Could not collect Pillow (6.0.0)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

11reactions
yxchngcommented, Apr 11, 2019

Having learning rate that is too large is indeed the problem. Lowering the learning rate solves the problem.

0reactions
Enn29commented, Sep 15, 2020

hello, I have met the seem issue, then i reduce the learning rate, but i can’t reslove it. so could you help me to reslove the issue, thanks!

the error in below: File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\engine\trainer.py”, line 88, in do_train loss_dict = model(images, targets) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py”, line 491, in call result = self.forward(*input, **kwargs) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\apex-0.1-py3.7-win-amd64.egg\apex\amp_initialize.py”, line 194, in new_fwd **applier(kwargs, input_caster)) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\detector\generalized_rcnn.py”, line 60, in forward x, result, detector_losses = self.roi_heads(features, proposals, targets) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py”, line 491, in call result = self.forward(*input, **kwargs) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\roi_heads.py”, line 26, in forward x, detections, loss_box = self.box(features, proposals, targets) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py”, line 491, in call result = self.forward(*input, **kwargs) File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\box_head\box_head.py”, line 56, in forward [class_logits], [box_regression] File “C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\box_head\loss.py”, line 151, in call sampled_pos_inds_subset = torch.nonzero(labels > 0).squeeze(1) RuntimeError: copy_if failed to synchronize: device-side assert triggered

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyTorch: copy_if failed to synchronize: device-side assert ...
Sometimes when we run code using cuda, it gives error message having device-side assert triggered which hides the real error message.
Read more >
RuntimeError: copy_if failed to synchronize ... - PyTorch Forums
I'm getting the following errors with my code. It is an adapted version of the PyTorch DQN example.
Read more >
copy_if failed to synchronize: device-side assert triggered问题 ...
报错内容. RuntimeError: copy_if failed to synchronize: device-side assert triggered. 情况说明. 本人遇到的这个问题主要是出现在损失计算处, ...
Read more >
Tpetra: Run-time error in idot unit test, in CUDA build only
I suspect the issue is that KokkosBlas::dot uses its X vector input to determine the execution space, but a raw pointer result argument...
Read more >
CUDA C++ Programming Guide - NVIDIA Documentation Center
The CUDA memory consistency model guarantees that the asserted condition will be ... In order to fully understand the device-side synchronization model, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found