Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument

See original GitHub issue

🐛 Bug

Hi @fmassa , when I adopt multi-gpu training, it seems the NCCL error appears.

  File "tools/train_net.py", line 174, in <module>
    main()
  File "tools/train_net.py", line 144, in main
    synchronize()
  File "/home/chenjoya/projects/maskrcnn-sampler/maskrcnn_benchmark/utils/comm.py", line 45, in synchronize
    dist.barrier()
  File "/home/chenjoya/opt/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1177, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument
Traceback (most recent call last):
  File "tools/train_net.py", line 174, in <module>
    main()
  File "tools/train_net.py", line 144, in main
    synchronize()
  File "/home/chenjoya/projects/maskrcnn-sampler/maskrcnn_benchmark/utils/comm.py", line 45, in synchronize
    dist.barrier()
  File "/home/chenjoya/opt/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1177, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument

To Reproduce

The training script:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_net.py --config-file configs/e2e_faster_rcnn_R_50_FPN_1x.yaml

Environment

pytorch 1.0.1 CUDA 9.0 NVIDIA-SMI 390.87 GCC 4.9.2-6 Linux version 3.10.0-862.14.4.el7.x86_64

Thanks for your attention ! ^ ^

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

florence27commented, Jul 4, 2019

Hi @SpongebBob I met the same error, running on 8GPUs and 10CPUs, did you resolve it by any chance?

0reactions

SpongebBobcommented, Jun 1, 2019

Hi @SpongebBob @djiajunustc , thanks for your attention to this problem.

The problem occurs due to the insufficient CPU resources (e.g. the maximum number of threads, RAM memory).

For me, after increasing the number of restricted threads, it works.

I use 30 cpus and 4gpus. I don’t know if it is same as your issue.

Top Results From Across the Web

NCCL Error 4: invalid argument · Issue #31045 - GitHub

Bug After upgrading to CUDA 10.2 (10.2, V10.2.89), and nccl-2.5.6-1 (PyTorch 1.3.1), I have the following error when using DataParallel: ...

How to solve the famous `unhandled cuda error, NCCL ...

I had the right cuda installed meaning: python -c "import torch;print(torch.version.cuda)" #was equal to nvcc -V.

NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779 ...

_broadcast_coalesced( RuntimeError: NCCL error in: ... invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid ...

Error in running test file - Google Groups

RuntimeError : scan failed on 2nd step: cudaErrorInvalidValue: invalid argument". It seems like my system doesn't recognize cuda package.

Installation Guide - NCCL - NVIDIA Documentation Center

Error codes have been merged into the ncclInvalidArgument category and have been simplified. A new ncclInvalidUsage code has been created to cover new ......