RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument
See original GitHub issue🐛 Bug
Hi @fmassa , when I adopt multi-gpu training, it seems the NCCL error appears.
File "tools/train_net.py", line 174, in <module>
main()
File "tools/train_net.py", line 144, in main
synchronize()
File "/home/chenjoya/projects/maskrcnn-sampler/maskrcnn_benchmark/utils/comm.py", line 45, in synchronize
dist.barrier()
File "/home/chenjoya/opt/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1177, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument
Traceback (most recent call last):
File "tools/train_net.py", line 174, in <module>
main()
File "tools/train_net.py", line 144, in main
synchronize()
File "/home/chenjoya/projects/maskrcnn-sampler/maskrcnn_benchmark/utils/comm.py", line 45, in synchronize
dist.barrier()
File "/home/chenjoya/opt/anaconda3/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1177, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1549635019666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:388, invalid argument
To Reproduce
The training script:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_net.py --config-file configs/e2e_faster_rcnn_R_50_FPN_1x.yaml
Environment
pytorch 1.0.1 CUDA 9.0 NVIDIA-SMI 390.87 GCC 4.9.2-6 Linux version 3.10.0-862.14.4.el7.x86_64
Thanks for your attention ! ^ ^
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:7 (3 by maintainers)
Top Results From Across the Web
NCCL Error 4: invalid argument · Issue #31045 - GitHub
Bug After upgrading to CUDA 10.2 (10.2, V10.2.89), and nccl-2.5.6-1 (PyTorch 1.3.1), I have the following error when using DataParallel: ...
Read more >How to solve the famous `unhandled cuda error, NCCL ...
I had the right cuda installed meaning: python -c "import torch;print(torch.version.cuda)" #was equal to nvcc -V.
Read more >NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779 ...
_broadcast_coalesced( RuntimeError: NCCL error in: ... invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid ...
Read more >Error in running test file - Google Groups
RuntimeError : scan failed on 2nd step: cudaErrorInvalidValue: invalid argument". It seems like my system doesn't recognize cuda package.
Read more >Installation Guide - NCCL - NVIDIA Documentation Center
Error codes have been merged into the ncclInvalidArgument category and have been simplified. A new ncclInvalidUsage code has been created to cover new ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @SpongebBob I met the same error, running on 8GPUs and 10CPUs, did you resolve it by any chance?
I use 30 cpus and 4gpus. I don’t know if it is same as your issue.