CUDA error from spconv
See original GitHub issueHi,
we are training the listed model on kitti, when it comes to both of the ParttA2 models, a CUDA error from spconv pop up. For other models which use spconv, the training processes are fine. So we think that the spconv installation is good.
File "train.py", line 198, in <module>
main()
File "train.py", line 170, in main
merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch
File "/workspace/wanghuijie-data/pcdet/scripts/train_utils/train_utils.py", line 93, in train_model
dataloader_iter=dataloader_iter
File "/workspace/wanghuijie-data/pcdet/scripts/train_utils/train_utils.py", line 38, in train_one_epoch
loss, tb_dict, disp_dict = model_func(model, batch)
File "/workspace/wanghuijie-data/pcdet/pcdet/models/__init__.py", line 30, in model_func
ret_dict, tb_dict, disp_dict = model(batch_dict)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/wanghuijie-data/pcdet/pcdet/models/detectors/point_rcnn.py", line 11, in forward
batch_dict = cur_module(batch_dict)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/wanghuijie-data/pcdet/pcdet/models/roi_heads/partA2_head.py", line 200, in forward
x_part = self.conv_part(part_features)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/modules.py", line 134, in forward
input = module(input)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/modules.py", line 134, in forward
input = module(input)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/conv.py", line 181, in forward
use_hash=self.use_hash)
File "/workspace/.venv/point-cloud/lib/python3.6/site-packages/spconv/ops.py", line 95, in get_indice_pairs
int(use_hash))
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd430ec42f2 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fd430ec167b in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7fd4392e41f9 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fd430eac3a4 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7fd452338d19 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7fd45232db5a in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fd452355b92 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fd451c70056 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa43caf (0x7fd452358caf in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x36a4a8 (0x7fd451c7f4a8 in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36b7ae (0x7fd451c807ae in /workspace/.venv/point-cloud/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0xf2828 (0x5643e76d1828 in /workspace/.venv/point-cloud/bin/python)
frame #12: <unknown function> + 0x19aa90 (0x5643e7779a90 in /workspace/.venv/point-cloud/bin/python)
frame #13: <unknown function> + 0xf2247 (0x5643e76d1247 in /workspace/.venv/point-cloud/bin/python)
frame #14: <unknown function> + 0xf20d7 (0x5643e76d10d7 in /workspace/.venv/point-cloud/bin/python)
frame #15: <unknown function> + 0xf20ed (0x5643e76d10ed in /workspace/.venv/point-cloud/bin/python)
frame #16: PyDict_SetItem + 0x3da (0x5643e7717d7a in /workspace/.venv/point-cloud/bin/python)
frame #17: PyDict_SetItemString + 0x4f (0x5643e771ec5f in /workspace/.venv/point-cloud/bin/python)
frame #18: PyImport_Cleanup + 0x99 (0x5643e7783dc9 in /workspace/.venv/point-cloud/bin/python)
frame #19: Py_FinalizeEx + 0x61 (0x5643e77ee961 in /workspace/.venv/point-cloud/bin/python)
frame #20: Py_Main + 0x35e (0x5643e77f8cae in /workspace/.venv/point-cloud/bin/python)
frame #21: main + 0xee (0x5643e76c2f2e in /workspace/.venv/point-cloud/bin/python)
frame #22: __libc_start_main + 0xe7 (0x7fd457450b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: <unknown function> + 0x1c327f (0x5643e77a227f in /workspace/.venv/point-cloud/bin/python)
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
CUDA error 700 · Issue #482 · traveller59/spconv - GitHub
I'm trying to implement SparseConvUNet for the PartA2 model similar or practically the same code as can be found in ...
Read more >spconv - PyPI
WARNING: spconv-cu117 may require CUDA Driver >= 515. pip install spconv for CPU only (Linux Only). you should only use this for debug...
Read more >CUDA Error: Device-Side Assert Triggered: Solved | Built In
A CUDA Error: Device-Side Assert Triggered can either be caused by an inconsistency between the number of labels and output units or an ......
Read more >构建SpareConvTensor中出现的CUDA error - CSDN博客
RuntimeError: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API ...
Read more >Prerequisites — MMDetection3D 1.0.0rc4 documentation
For example, using CUDA 10.2, the command will be pip install cumm-cu102 && pip install spconv-cu102 . Supported CUDA versions include 10.2, 11.1,...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

But after train for a while, the following error is reported. Is there a solution please?
I’m using PVRCNN, and I can train secondIoU very well.