question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA error from spconv

See original GitHub issue

Hi,

we are training the listed model on kitti, when it comes to both of the ParttA2 models, a CUDA error from spconv pop up. For other models which use spconv, the training processes are fine. So we think that the spconv installation is good.

  File "train.py", line 198, in <module>
    main()
  File "train.py", line 170, in main
    merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch
  File "/workspace/wanghuijie-data/pcdet/scripts/train_utils/train_utils.py", line 93, in train_model
    dataloader_iter=dataloader_iter
  File "/workspace/wanghuijie-data/pcdet/scripts/train_utils/train_utils.py", line 38, in train_one_epoch
    loss, tb_dict, disp_dict = model_func(model, batch)
  File "/workspace/wanghuijie-data/pcdet/pcdet/models/__init__.py", line 30, in model_func
    ret_dict, tb_dict, disp_dict = model(batch_dict)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/wanghuijie-data/pcdet/pcdet/models/detectors/point_rcnn.py", line 11, in forward
    batch_dict = cur_module(batch_dict)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/wanghuijie-data/pcdet/pcdet/models/roi_heads/partA2_head.py", line 200, in forward
    x_part = self.conv_part(part_features)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/modules.py", line 134, in forward
    input = module(input)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/modules.py", line 134, in forward
    input = module(input)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/conv.py", line 181, in forward
    use_hash=self.use_hash)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/ops.py", line 95, in get_indice_pairs
    int(use_hash))
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd430ec42f2 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fd430ec167b in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7fd4392e41f9 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fd430eac3a4 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7fd452338d19 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7fd45232db5a in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fd452355b92 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fd451c70056 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa43caf (0x7fd452358caf in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x36a4a8 (0x7fd451c7f4a8 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36b7ae (0x7fd451c807ae in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0xf2828 (0x5643e76d1828 in /workspace/.venv/point-cloud/bin/python)
frame #12: <unknown function> + 0x19aa90 (0x5643e7779a90 in /workspace/.venv/point-cloud/bin/python)
frame #13: <unknown function> + 0xf2247 (0x5643e76d1247 in /workspace/.venv/point-cloud/bin/python)
frame #14: <unknown function> + 0xf20d7 (0x5643e76d10d7 in /workspace/.venv/point-cloud/bin/python)
frame #15: <unknown function> + 0xf20ed (0x5643e76d10ed in /workspace/.venv/point-cloud/bin/python)
frame #16: PyDict_SetItem + 0x3da (0x5643e7717d7a in /workspace/.venv/point-cloud/bin/python)
frame #17: PyDict_SetItemString + 0x4f (0x5643e771ec5f in /workspace/.venv/point-cloud/bin/python)
frame #18: PyImport_Cleanup + 0x99 (0x5643e7783dc9 in /workspace/.venv/point-cloud/bin/python)
frame #19: Py_FinalizeEx + 0x61 (0x5643e77ee961 in /workspace/.venv/point-cloud/bin/python)
frame #20: Py_Main + 0x35e (0x5643e77f8cae in /workspace/.venv/point-cloud/bin/python)
frame #21: main + 0xee (0x5643e76c2f2e in /workspace/.venv/point-cloud/bin/python)
frame #22: __libc_start_main + 0xe7 (0x7fd457450b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: <unknown function> + 0x1c327f (0x5643e77a227f in /workspace/.venv/point-cloud/bin/python)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
KangChoucommented, Jun 24, 2021

image

0reactions
Leozyc-wasedacommented, Apr 12, 2022

But after train for a while, the following error is reported. Is there a solution please?

I’m using PVRCNN, and I can train secondIoU very well.

[2022-04-11 11:38:29,604  train.py 168  INFO]  **********************Start training da-pandaset-kitti_models/pvrcnn/pvrcnn_old_anchor(default)**********************
epochs:   0%|                       | 0/50 [00:09<?, ?it/s, loss=4.41, lr=0.001][2022-04-11 11:38:39,291  pandaset_dataset.py 234 WARNING]  The car's pitch is supposed to be negligible sin(pitch) is >= 10**-1 (0.15420941722210868)
epochs:   0%|                       | 0/50 [00:16<?, ?it/s, loss=8.99, lr=0.001][2022-04-11 11:38:46,110  pandaset_dataset.py 234 WARNING]  The car's pitch is supposed to be negligible sin(pitch) is >= 10**-1 (0.10650048163354531)
epochs:   0%|                       | 0/50 [00:21<?, ?it/s, loss=3.38, lr=0.001]
Traceback (most recent call last): 33/4880 [00:21<51:38,  1.56it/s, total_it=33]
  File "train.py", line 199, in <module>
    main()
  File "train.py", line 191, in main
    ema_model=None
  File "/home/algo-4/work/ST3D/tools/train_utils/train_utils.py", line 108, in train_model
    dataloader_iter=dataloader_iter
  File "/home/algo-4/work/ST3D/tools/train_utils/train_utils.py", line 53, in train_one_epoch
    loss, tb_dict, disp_dict = model_func(model, batch)
  File "/home/algo-4/work/ST3D/tools/../pcdet/models/__init__.py", line 31, in model_func
    ret_dict, tb_dict, disp_dict = model(batch_dict)
  File "/home/algo-4/anaconda3/envs/ST3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algo-4/work/ST3D/tools/../pcdet/models/detectors/pv_rcnn.py", line 13, in forward
    batch_dict = cur_module(batch_dict)
  File "/home/algo-4/anaconda3/envs/ST3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algo-4/work/ST3D/tools/../pcdet/models/backbones_3d/pfe/voxel_set_abstraction.py", line 224, in forward
    features=batch_dict['multi_scale_3d_features'][src_name].features.contiguous(),
  File "/home/algo-4/anaconda3/envs/ST3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algo-4/work/ST3D/tools/../pcdet/ops/pointnet2/pointnet2_stack/pointnet2_modules.py", line 70, in forward
    xyz, xyz_batch_cnt, new_xyz, new_xyz_batch_cnt, features
  File "/home/algo-4/anaconda3/envs/ST3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algo-4/work/ST3D/tools/../pcdet/ops/pointnet2/pointnet2_stack/pointnet2_utils.py", line 143, in forward
    grouped_xyz[empty_ball_mask] = 0
RuntimeError: copy_if failed to synchronize: an illegal memory access was encountered
Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA error 700 · Issue #482 · traveller59/spconv - GitHub
I'm trying to implement SparseConvUNet for the PartA2 model similar or practically the same code as can be found in ...
Read more >
spconv - PyPI
WARNING: spconv-cu117 may require CUDA Driver >= 515. pip install spconv for CPU only (Linux Only). you should only use this for debug...
Read more >
CUDA Error: Device-Side Assert Triggered: Solved | Built In
A CUDA Error: Device-Side Assert Triggered can either be caused by an inconsistency between the number of labels and output units or an ......
Read more >
构建SpareConvTensor中出现的CUDA error - CSDN博客
RuntimeError: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API ...
Read more >
Prerequisites — MMDetection3D 1.0.0rc4 documentation
For example, using CUDA 10.2, the command will be pip install cumm-cu102 && pip install spconv-cu102 . Supported CUDA versions include 10.2, 11.1,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found