Different behaviors between training with a single gpu and multiple gpus
See original GitHub issueWhen training SSD with pytorch-1.2, it comes with some errors if we only use a single gpu. However, if we train it in distributed mode, everything is fine. It is a little weird that they have different behaviors.
Here are commands I tried
python tools/train.py /home/ubuntu/mmdetection/configs/ssd300_coco.py
./tools/dist_train.sh /home/ubuntu/mmdetection/configs/ssd300_coco.py 8
CUDA version is 10.1
Here is the error message for training with a single gpu
>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [11,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [12,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [13,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [14,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [16,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [17,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [18,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [19,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [20,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [21,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [22,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [23,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [25,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [26,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [27,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [28,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [29,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [30,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [31,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
Traceback (most recent call last):
File "tools/train.py", line 110, in <module>
main()
File "tools/train.py", line 106, in main
logger=logger)
File "/home/ubuntu/mmdetection/mmdet/apis/train.py", line 65, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/ubuntu/mmdetection/mmdet/apis/train.py", line 237, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 363, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 274, in train
self.call_hook('after_train_iter')
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 230, in call_hook
getattr(hook, fn_name)(self)
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py", line 17, in after_train_iter
runner.outputs['loss'].backward()
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565272279342/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f55e61e6e37 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x12e14 (0x7f55e641ee14 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x165bf (0x7f55e64225bf in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f55e61d1fa4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x141ece4 (0x7f55e92a5ce4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x31b3ca0 (0x7f55eb03aca0 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x3765dc2 (0x7f55eb5ecdc2 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f55eb5ece6f in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x3782a61 (0x7f55eb609a61 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f55e61d1f50 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x1ba9b4 (0x7f56172d19b4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x4000eb (0x7f56175170eb in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x400121 (0x7f5617517121 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #28: __libc_start_main + 0xf0 (0x7f562628d830 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Efficient Training on Multiple GPUs - Hugging Face
Switching from a single GPU to multiple requires some form of parallelism as the work needs to be distributed. There are several techniques...
Read more >Multi-GPU and distributed training - Keras
This guide focuses on data parallelism, in particular synchronous data parallelism, where the different replicas of the model stay in sync after ...
Read more >How to scale training on multiple GPUs - Towards Data Science
I will cover the main differences between the two, and how training in multiple GPUs works. I will first explain how the training...
Read more >Multi-GPU and Distributed Deep Learning - frankdenneman.nl
With model parallelism, a single model (Neural Network A) is split and distributed across different GPUs (GPU0 and GPU1). The same (full) ...
Read more >The importance of hyperparameter tuning for scaling deep ...
When moving from training on a single GPU to training on multiple GPUs, a good heuristic is to increase the mini-batch size by...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
When you are using a single GPU, you need to modify the learning rate accordingly.
the same error comes to me too when using single gpu! i use pytorch1.2 , and i check data label many times to make sure data is ok. besides, sometimes trainning can run for a few iterations and break down suddenly with the same error, it’s truly weird