question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different behaviors between training with a single gpu and multiple gpus

See original GitHub issue

When training SSD with pytorch-1.2, it comes with some errors if we only use a single gpu. However, if we train it in distributed mode, everything is fine. It is a little weird that they have different behaviors.

Here are commands I tried

python tools/train.py /home/ubuntu/mmdetection/configs/ssd300_coco.py

./tools/dist_train.sh /home/ubuntu/mmdetection/configs/ssd300_coco.py 8

CUDA version is 10.1

Here is the error message for training with a single gpu

>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [11,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [12,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [13,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [14,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [16,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [17,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [18,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [19,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [20,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [21,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [22,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [23,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [25,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [26,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [27,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [28,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [29,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [30,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [31,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
Traceback (most recent call last):
  File "tools/train.py", line 110, in <module>
    main()
  File "tools/train.py", line 106, in main
    logger=logger)
  File "/home/ubuntu/mmdetection/mmdet/apis/train.py", line 65, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/home/ubuntu/mmdetection/mmdet/apis/train.py", line 237, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 363, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 274, in train
    self.call_hook('after_train_iter')
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 230, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py", line 17, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565272279342/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f55e61e6e37 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x12e14 (0x7f55e641ee14 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x165bf (0x7f55e64225bf in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f55e61d1fa4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x141ece4 (0x7f55e92a5ce4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x31b3ca0 (0x7f55eb03aca0 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x3765dc2 (0x7f55eb5ecdc2 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f55eb5ece6f in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x3782a61 (0x7f55eb609a61 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f55e61d1f50 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x1ba9b4 (0x7f56172d19b4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x4000eb (0x7f56175170eb in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x400121 (0x7f5617517121 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #28: __libc_start_main + 0xf0 (0x7f562628d830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
hellockcommented, Dec 23, 2019

When you are using a single GPU, you need to modify the learning rate accordingly.

2reactions
rangerTobycommented, Dec 23, 2019

the same error comes to me too when using single gpu! i use pytorch1.2 , and i check data label many times to make sure data is ok. besides, sometimes trainning can run for a few iterations and break down suddenly with the same error, it’s truly weird

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face
Switching from a single GPU to multiple requires some form of parallelism as the work needs to be distributed. There are several techniques...
Read more >
Multi-GPU and distributed training - Keras
This guide focuses on data parallelism, in particular synchronous data parallelism, where the different replicas of the model stay in sync after ...
Read more >
How to scale training on multiple GPUs - Towards Data Science
I will cover the main differences between the two, and how training in multiple GPUs works. I will first explain how the training...
Read more >
Multi-GPU and Distributed Deep Learning - frankdenneman.nl
With model parallelism, a single model (Neural Network A) is split and distributed across different GPUs (GPU0 and GPU1). The same (full) ...
Read more >
The importance of hyperparameter tuning for scaling deep ...
When moving from training on a single GPU to training on multiple GPUs, a good heuristic is to increase the mini-batch size by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found