Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA error from spconv

See original GitHub issue

Hi,

we are training the listed model on kitti, when it comes to both of the ParttA2 models, a CUDA error from spconv pop up. For other models which use spconv, the training processes are fine. So we think that the spconv installation is good.

  File "train.py", line 198, in <module>
    main()
  File "train.py", line 170, in main
    merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch
  File "/workspace/wanghuijie-data/pcdet/scripts/train_utils/train_utils.py", line 93, in train_model
    dataloader_iter=dataloader_iter
  File "/workspace/wanghuijie-data/pcdet/scripts/train_utils/train_utils.py", line 38, in train_one_epoch
    loss, tb_dict, disp_dict = model_func(model, batch)
  File "/workspace/wanghuijie-data/pcdet/pcdet/models/__init__.py", line 30, in model_func
    ret_dict, tb_dict, disp_dict = model(batch_dict)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/wanghuijie-data/pcdet/pcdet/models/detectors/point_rcnn.py", line 11, in forward
    batch_dict = cur_module(batch_dict)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/wanghuijie-data/pcdet/pcdet/models/roi_heads/partA2_head.py", line 200, in forward
    x_part = self.conv_part(part_features)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/modules.py", line 134, in forward
    input = module(input)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/modules.py", line 134, in forward
    input = module(input)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/conv.py", line 181, in forward
    use_hash=self.use_hash)
  File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/ops.py", line 95, in get_indice_pairs
    int(use_hash))
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd430ec42f2 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fd430ec167b in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7fd4392e41f9 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fd430eac3a4 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7fd452338d19 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7fd45232db5a in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fd452355b92 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fd451c70056 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa43caf (0x7fd452358caf in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x36a4a8 (0x7fd451c7f4a8 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36b7ae (0x7fd451c807ae in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0xf2828 (0x5643e76d1828 in /workspace/.venv/point-cloud/bin/python)
frame #12: <unknown function> + 0x19aa90 (0x5643e7779a90 in /workspace/.venv/point-cloud/bin/python)
frame #13: <unknown function> + 0xf2247 (0x5643e76d1247 in /workspace/.venv/point-cloud/bin/python)
frame #14: <unknown function> + 0xf20d7 (0x5643e76d10d7 in /workspace/.venv/point-cloud/bin/python)
frame #15: <unknown function> + 0xf20ed (0x5643e76d10ed in /workspace/.venv/point-cloud/bin/python)
frame #16: PyDict_SetItem + 0x3da (0x5643e7717d7a in /workspace/.venv/point-cloud/bin/python)
frame #17: PyDict_SetItemString + 0x4f (0x5643e771ec5f in /workspace/.venv/point-cloud/bin/python)
frame #18: PyImport_Cleanup + 0x99 (0x5643e7783dc9 in /workspace/.venv/point-cloud/bin/python)
frame #19: Py_FinalizeEx + 0x61 (0x5643e77ee961 in /workspace/.venv/point-cloud/bin/python)
frame #20: Py_Main + 0x35e (0x5643e77f8cae in /workspace/.venv/point-cloud/bin/python)
frame #21: main + 0xee (0x5643e76c2f2e in /workspace/.venv/point-cloud/bin/python)
frame #22: __libc_start_main + 0xe7 (0x7fd457450b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: <unknown function> + 0x1c327f (0x5643e77a227f in /workspace/.venv/point-cloud/bin/python)

Issue Analytics

State:
Created 2 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

KangChoucommented, Jun 24, 2021

0reactions

Leozyc-wasedacommented, Apr 12, 2022

But after train for a while, the following error is reported. Is there a solution please?

I’m using PVRCNN, and I can train secondIoU very well.

[2022-04-11 11:38:29,604  train.py 168  INFO]  **********************Start training da-pandaset-kitti_models/pvrcnn/pvrcnn_old_anchor(default)**********************
epochs:   0%|                       | 0/50 [00:09<?, ?it/s, loss=4.41, lr=0.001][2022-04-11 11:38:39,291  pandaset_dataset.py 234 WARNING]  The car's pitch is supposed to be negligible sin(pitch) is >= 10**-1 (0.15420941722210868)
epochs:   0%|                       | 0/50 [00:16<?, ?it/s, loss=8.99, lr=0.001][2022-04-11 11:38:46,110  pandaset_dataset.py 234 WARNING]  The car's pitch is supposed to be negligible sin(pitch) is >= 10**-1 (0.10650048163354531)
epochs:   0%|                       | 0/50 [00:21<?, ?it/s, loss=3.38, lr=0.001]
Traceback (most recent call last): 33/4880 [00:21<51:38,  1.56it/s, total_it=33]
  File "train.py", line 199, in <module>
    main()
  File "train.py", line 191, in main
    ema_model=None
  File "/home/algo-4/work/ST3D/tools/train_utils/train_utils.py", line 108, in train_model
    dataloader_iter=dataloader_iter
  File "/home/algo-4/work/ST3D/tools/train_utils/train_utils.py", line 53, in train_one_epoch
    loss, tb_dict, disp_dict = model_func(model, batch)
  File "/home/algo-4/work/ST3D/tools/../pcdet/models/__init__.py", line 31, in model_func
    ret_dict, tb_dict, disp_dict = model(batch_dict)
  File "/home/algo-4/anaconda3/envs/ST3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algo-4/work/ST3D/tools/../pcdet/models/detectors/pv_rcnn.py", line 13, in forward
    batch_dict = cur_module(batch_dict)
  File "/home/algo-4/anaconda3/envs/ST3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algo-4/work/ST3D/tools/../pcdet/models/backbones_3d/pfe/voxel_set_abstraction.py", line 224, in forward
    features=batch_dict['multi_scale_3d_features'][src_name].features.contiguous(),
  File "/home/algo-4/anaconda3/envs/ST3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algo-4/work/ST3D/tools/../pcdet/ops/pointnet2/pointnet2_stack/pointnet2_modules.py", line 70, in forward
    xyz, xyz_batch_cnt, new_xyz, new_xyz_batch_cnt, features
  File "/home/algo-4/anaconda3/envs/ST3D/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algo-4/work/ST3D/tools/../pcdet/ops/pointnet2/pointnet2_stack/pointnet2_utils.py", line 143, in forward
    grouped_xyz[empty_ball_mask] = 0
RuntimeError: copy_if failed to synchronize: an illegal memory access was encountered