question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument

See original GitHub issue

🐛 Bug

Hi @fmassa , when I adopt multi-gpu training, it seems the NCCL error appears.

  File "tools/train_net.py", line 174, in <module>
    main()
  File "tools/train_net.py", line 144, in main
    synchronize()
  File "/home/chenjoya/projects/maskrcnn-sampler/maskrcnn_benchmark/utils/comm.py", line 45, in synchronize
    dist.barrier()
  File "/home/chenjoya/opt/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1177, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument
Traceback (most recent call last):
  File "tools/train_net.py", line 174, in <module>
    main()
  File "tools/train_net.py", line 144, in main
    synchronize()
  File "/home/chenjoya/projects/maskrcnn-sampler/maskrcnn_benchmark/utils/comm.py", line 45, in synchronize
    dist.barrier()
  File "/home/chenjoya/opt/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1177, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument

To Reproduce

The training script:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_net.py --config-file configs/e2e_faster_rcnn_R_50_FPN_1x.yaml

Environment

pytorch 1.0.1 CUDA 9.0 NVIDIA-SMI 390.87 GCC 4.9.2-6 Linux version 3.10.0-862.14.4.el7.x86_64

Thanks for your attention ! ^ ^

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
florence27commented, Jul 4, 2019

Hi @SpongebBob I met the same error, running on 8GPUs and 10CPUs, did you resolve it by any chance?

0reactions
SpongebBobcommented, Jun 1, 2019

Hi @SpongebBob @djiajunustc , thanks for your attention to this problem.

The problem occurs due to the insufficient CPU resources (e.g. the maximum number of threads, RAM memory).

For me, after increasing the number of restricted threads, it works.

I use 30 cpus and 4gpus. I don’t know if it is same as your issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NCCL Error 4: invalid argument · Issue #31045 - GitHub
Bug After upgrading to CUDA 10.2 (10.2, V10.2.89), and nccl-2.5.6-1 (PyTorch 1.3.1), I have the following error when using DataParallel: ...
Read more >
How to solve the famous `unhandled cuda error, NCCL ...
I had the right cuda installed meaning: python -c "import torch;print(torch.version.cuda)" #was equal to nvcc -V.
Read more >
NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779 ...
_broadcast_coalesced( RuntimeError: NCCL error in: ... invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid ...
Read more >
Error in running test file - Google Groups
RuntimeError : scan failed on 2nd step: cudaErrorInvalidValue: invalid argument". It seems like my system doesn't recognize cuda package.
Read more >
Installation Guide - NCCL - NVIDIA Documentation Center
Error codes have been merged into the ncclInvalidArgument category and have been simplified. A new ncclInvalidUsage code has been created to cover new ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found