question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed training error on 2 nodes each with 8 or 4 GPUs

See original GitHub issue

When I run the Distributed training on 2 nodes each with 8 or 4 GPUs, worker node terminated with the following errors:

`pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying

pytorch-fairseq-worker-0:18:64 [2] include/socket.h:382 NCCL WARN Connect to 172.31.1.37<58944> failed : Connection refused pytorch-fairseq-worker-0:18:64 [2] NCCL INFO transport/net_socket.cu:142 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO bootstrap.cu:19 -> 2 pytorch-fairseq-worker-0:18:64 [2] NCCL INFO bootstrap.cu:218 -> 2 pytorch-fairseq-worker-0:18:64 [2] NCCL INFO init.cu:683 -> 2 pytorch-fairseq-worker-0:18:64 [2] NCCL INFO init.cu:915 -> 2 pytorch-fairseq-worker-0:18:64 [2] NCCL INFO misc/group.cu:69 -> 2 [Async thread]

pytorch-fairseq-worker-0:16:63 [0] include/socket.h:382 NCCL WARN Connect to 172.31.1.37<58944> failed : Connection refused pytorch-fairseq-worker-0:16:63 [0] NCCL INFO transport/net_socket.cu:142 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO bootstrap.cu:19 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO bootstrap.cu:218 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO init.cu:683 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO init.cu:915 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO misc/group.cu:69 -> 2 [Async thread] Traceback (most recent call last): File “/opt/conda/bin/fairseq-train”, line 10, in <module> sys.exit(cli_main()) File “/opt/conda/lib/python3.6/site-packages/fairseq_cli/train.py”, line 302, in cli_main nprocs=torch.cuda.device_count(), File “/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 167, in spawn while not spawn_context.join(): File “/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 114, in join raise Exception(msg) Exception:

– Process 4 terminated with the following error: Traceback (most recent call last): File “/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/opt/conda/lib/python3.6/site-packages/fairseq_cli/train.py”, line 284, in distributed_main main(args, init_distributed=True) File “/opt/conda/lib/python3.6/site-packages/fairseq_cli/train.py”, line 33, in main args.distributed_rank = distributed_utils.distributed_init(args) File “/opt/conda/lib/python3.6/site-packages/fairseq/distributed_utils.py”, line 90, in distributed_init dist.all_reduce(torch.rand(1).cuda()) File “/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 900, in all_reduce work = _default_pg.allreduce([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1556653099582/work/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled system error

Traceback (most recent call last): File “/opt/conda/lib/python3.6/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/opt/conda/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py”, line 235, in <module> main() File “/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py”, line 231, in main cmd=process.args) `

Master node prints:

`| distributed init (rank 0): tcp://pytorch-fairseq-master-0:23456 | distributed init (rank 3): tcp://pytorch-fairseq-master-0:23456 | initialized host pytorch-fairseq-master-0 as rank 3 | distributed init (rank 1): tcp://pytorch-fairseq-master-0:23456 | initialized host pytorch-fairseq-master-0 as rank 1 | distributed init (rank 2): tcp://pytorch-fairseq-master-0:23456 | initialized host pytorch-fairseq-master-0 as rank 2 | initialized host pytorch-fairseq-master-0 as rank 0 pytorch-fairseq-master-0:16:16 [0] NCCL INFO NET/Socket : Using [0]eth0:172.31.1.89<0> pytorch-fairseq-master-0:16:16 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

pytorch-fairseq-master-0:16:16 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] NCCL version 2.4.2+cuda10.0 pytorch-fairseq-master-0:17:17 [1] NCCL INFO NET/Socket : Using [0]eth0:172.31.1.89<0> pytorch-fairseq-master-0:18:18 [2] NCCL INFO NET/Socket : Using [0]eth0:172.31.1.89<0> pytorch-fairseq-master-0:19:19 [3] NCCL INFO NET/Socket : Using [0]eth0:172.31.1.89<0> pytorch-fairseq-master-0:17:17 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). pytorch-fairseq-master-0:19:19 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). pytorch-fairseq-master-0:18:18 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

pytorch-fairseq-master-0:17:17 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]

pytorch-fairseq-master-0:19:19 [3] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]

pytorch-fairseq-master-0:18:18 [2] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] pytorch-fairseq-master-0:17:43 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff pytorch-fairseq-master-0:17:43 [1] NCCL INFO comm 0x7fe25c001ac0 rank 1 nranks 8 cudaDev 1 nvmlDev 4 pytorch-fairseq-master-0:18:44 [2] NCCL INFO Setting affinity for GPU 2 to 3ff003ff pytorch-fairseq-master-0:18:44 [2] NCCL INFO comm 0x7fae14001ac0 rank 2 nranks 8 cudaDev 2 nvmlDev 5 pytorch-fairseq-master-0:19:45 [3] NCCL INFO Setting affinity for GPU 3 to 3ff003ff pytorch-fairseq-master-0:19:45 [3] NCCL INFO comm 0x7f5b84001ac0 rank 3 nranks 8 cudaDev 3 nvmlDev 6 pytorch-fairseq-master-0:16:42 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff pytorch-fairseq-master-0:16:42 [0] NCCL INFO comm 0x7f789c001ac0 rank 0 nranks 8 cudaDev 0 nvmlDev 3

pytorch-fairseq-master-0:16:41 [0] bootstrap.cu:102 NCCL WARN Bootstrap Root : rank 3 of 8 ranks has already checked in `

env:

k8s 1.13.5 cuda 10.0.130 cudnn 7.5 python 3.6 pytorch 1.1.0 nccl 2.4.2 fairseq 0.8.0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
wenliancommented, Sep 26, 2019

Hi @wenlian,

Can you please share the information about how you setup the entire infrastructure for performing distributed training? I’m kind of struggling on how can I set up the infrastructure (i.e Servers and workers and how they communicate with each other). Is it parameter server based architecture or All reduce kind of architecture?

If you can share the information what all distributed training related flag you are using then it will really help me. Specially how are you using --distributed-init-method and --distributed-port flags? or if you can share the command that you are executing then please share that as well.

Thanks, Jalaj

It’s all-reduce architecture communicate with NCCL libs via Gigabit Ethernet(socket) or InfiniBand. You can get more information about NCCL variables at this link: NCCL variables. I am used to deploying the k8s pod of master & worker node by pytorch-operator and training data on distributed storage. System environment variable such as: MASTER_SOCKET_IFNAME=eth0 # or ib device name NCCL_IB_CUDA_SUPPORT=0 NCCL_P2P_DISABLE=0 NCCL_DEBUG=INFO NCCL_IB_DISABLE=1 # ib not set DIST_SIZE=16 DIST_RANK=0 # or 8. master node is 0, worker node is 8 if 2 nodes each with 8 GPUs NCCL_SOCKET_IFNAME=eth0 NCCL_NET_GDR_LEVEL=3 NCCL_NET_GDR_READ=0 NCCL_SHM_DISABLE=0 MASTER_IP=xxxx

start script as: /opt/conda/bin/fairseq-train /workspace/data/train_data --save-dir /data/checkpoints \ --distributed-world-size ${DIST_SIZE} --distributed-rank ${DIST_RANK} --ddp-backend no_c10d --distributed-init-method 'tcp://'${MASTER_IP}':23456' --distributed-port 23456 .... some of the parameters may need to be adjusted according to the environment

3reactions
wenliancommented, Sep 24, 2019

problem is solved…

Hi, I met the same problem. Could you please share how you solve it?

I started the multi-machine job directly by “/opt/conda/bin/fairseq-train /workspace/data/train_data --save-dir /…”. Problem is solved after setting "–distributed-world-size ${DIST_SIZE} --distributed-rank ${DIST_RANK} " DIST_SIZE: Total number of GPUs. 2 nodes each with 8 GPUs:16 ;2 nodes each with 4 GPUs:8 DIST_RANK:2 nodes each with 8 GPUs:first node is 0;second node is 8; 2 nodes each with 4 GPUs:first node is 0;second node is 4 Hope it helps

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed training error on 2 nodes each with 8 or 4 GPUs
When I run the Distributed training on 2 nodes each with 8 or 4 GPUs, worker node terminated with the following errors: `pytorch-fairseq-worker-0:16:63...
Read more >
Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >
Training on multiple GPUs and multi-node training ... - YouTube
In this video we'll cover how multi- GPU and multi- node training works in general. ... 20K views 2 years ago PyTorch Lightning...
Read more >
Encounter Error while running distributed training on fairseq
Hi PyTorch Community Members, I am trying to run distributed training on 2 nodes with 8 GPUs each (K80) in total 16 GPUs....
Read more >
PyTorch Distributed Training - Lei Mao's Log Book
PyTorch Distributed Training for Dummies. ... 2 3 4 5 6 7 8 ... Each of the nodes has 8 GPUs and each...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found