Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different behaviors between training with a single gpu and multiple gpus

See original GitHub issue

When training SSD with pytorch-1.2, it comes with some errors if we only use a single gpu. However, if we train it in distributed mode, everything is fine. It is a little weird that they have different behaviors.

Here are commands I tried

python tools/train.py /home/ubuntu/mmdetection/configs/ssd300_coco.py

./tools/dist_train.sh /home/ubuntu/mmdetection/configs/ssd300_coco.py 8

CUDA version is 10.1

Here is the error message for training with a single gpu

>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [11,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [12,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [13,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [14,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [16,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [17,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [18,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [19,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [20,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [21,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [22,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [23,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [25,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [26,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [27,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [28,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [29,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [30,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [31,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
Traceback (most recent call last):
  File "tools/train.py", line 110, in <module>
    main()
  File "tools/train.py", line 106, in main
    logger=logger)
  File "/home/ubuntu/mmdetection/mmdet/apis/train.py", line 65, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/home/ubuntu/mmdetection/mmdet/apis/train.py", line 237, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 363, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 274, in train
    self.call_hook('after_train_iter')
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 230, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py", line 17, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565272279342/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f55e61e6e37 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x12e14 (0x7f55e641ee14 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x165bf (0x7f55e64225bf in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f55e61d1fa4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x141ece4 (0x7f55e92a5ce4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x31b3ca0 (0x7f55eb03aca0 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x3765dc2 (0x7f55eb5ecdc2 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f55eb5ece6f in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x3782a61 (0x7f55eb609a61 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f55e61d1f50 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x1ba9b4 (0x7f56172d19b4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x4000eb (0x7f56175170eb in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x400121 (0x7f5617517121 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #28: __libc_start_main + 0xf0 (0x7f562628d830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Issue Analytics

State:
Created 4 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

3reactions

hellockcommented, Dec 23, 2019

When you are using a single GPU, you need to modify the learning rate accordingly.

2reactions

rangerTobycommented, Dec 23, 2019

the same error comes to me too when using single gpu! i use pytorch1.2 , and i check data label many times to make sure data is ok. besides, sometimes trainning can run for a few iterations and break down suddenly with the same error, it’s truly weird