Distributed training error on 2 nodes each with 8 or 4 GPUs
See original GitHub issueWhen I run the Distributed training on 2 nodes each with 8 or 4 GPUs, worker node terminated with the following errors:
`pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying
pytorch-fairseq-worker-0:18:64 [2] include/socket.h:382 NCCL WARN Connect to 172.31.1.37<58944> failed : Connection refused pytorch-fairseq-worker-0:18:64 [2] NCCL INFO transport/net_socket.cu:142 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO Call to connect returned Connection refused, retrying pytorch-fairseq-worker-0:18:64 [2] NCCL INFO bootstrap.cu:19 -> 2 pytorch-fairseq-worker-0:18:64 [2] NCCL INFO bootstrap.cu:218 -> 2 pytorch-fairseq-worker-0:18:64 [2] NCCL INFO init.cu:683 -> 2 pytorch-fairseq-worker-0:18:64 [2] NCCL INFO init.cu:915 -> 2 pytorch-fairseq-worker-0:18:64 [2] NCCL INFO misc/group.cu:69 -> 2 [Async thread]
pytorch-fairseq-worker-0:16:63 [0] include/socket.h:382 NCCL WARN Connect to 172.31.1.37<58944> failed : Connection refused pytorch-fairseq-worker-0:16:63 [0] NCCL INFO transport/net_socket.cu:142 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO bootstrap.cu:19 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO bootstrap.cu:218 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO init.cu:683 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO init.cu:915 -> 2 pytorch-fairseq-worker-0:16:63 [0] NCCL INFO misc/group.cu:69 -> 2 [Async thread] Traceback (most recent call last): File “/opt/conda/bin/fairseq-train”, line 10, in <module> sys.exit(cli_main()) File “/opt/conda/lib/python3.6/site-packages/fairseq_cli/train.py”, line 302, in cli_main nprocs=torch.cuda.device_count(), File “/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 167, in spawn while not spawn_context.join(): File “/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 114, in join raise Exception(msg) Exception:
– Process 4 terminated with the following error: Traceback (most recent call last): File “/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/opt/conda/lib/python3.6/site-packages/fairseq_cli/train.py”, line 284, in distributed_main main(args, init_distributed=True) File “/opt/conda/lib/python3.6/site-packages/fairseq_cli/train.py”, line 33, in main args.distributed_rank = distributed_utils.distributed_init(args) File “/opt/conda/lib/python3.6/site-packages/fairseq/distributed_utils.py”, line 90, in distributed_init dist.all_reduce(torch.rand(1).cuda()) File “/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 900, in all_reduce work = _default_pg.allreduce([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1556653099582/work/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled system error
Traceback (most recent call last): File “/opt/conda/lib/python3.6/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/opt/conda/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py”, line 235, in <module> main() File “/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py”, line 231, in main cmd=process.args) `
Master node prints:
`| distributed init (rank 0): tcp://pytorch-fairseq-master-0:23456 | distributed init (rank 3): tcp://pytorch-fairseq-master-0:23456 | initialized host pytorch-fairseq-master-0 as rank 3 | distributed init (rank 1): tcp://pytorch-fairseq-master-0:23456 | initialized host pytorch-fairseq-master-0 as rank 1 | distributed init (rank 2): tcp://pytorch-fairseq-master-0:23456 | initialized host pytorch-fairseq-master-0 as rank 2 | initialized host pytorch-fairseq-master-0 as rank 0 pytorch-fairseq-master-0:16:16 [0] NCCL INFO NET/Socket : Using [0]eth0:172.31.1.89<0> pytorch-fairseq-master-0:16:16 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
pytorch-fairseq-master-0:16:16 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] NCCL version 2.4.2+cuda10.0 pytorch-fairseq-master-0:17:17 [1] NCCL INFO NET/Socket : Using [0]eth0:172.31.1.89<0> pytorch-fairseq-master-0:18:18 [2] NCCL INFO NET/Socket : Using [0]eth0:172.31.1.89<0> pytorch-fairseq-master-0:19:19 [3] NCCL INFO NET/Socket : Using [0]eth0:172.31.1.89<0> pytorch-fairseq-master-0:17:17 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). pytorch-fairseq-master-0:19:19 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). pytorch-fairseq-master-0:18:18 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
pytorch-fairseq-master-0:17:17 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
pytorch-fairseq-master-0:19:19 [3] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
pytorch-fairseq-master-0:18:18 [2] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] pytorch-fairseq-master-0:17:43 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff pytorch-fairseq-master-0:17:43 [1] NCCL INFO comm 0x7fe25c001ac0 rank 1 nranks 8 cudaDev 1 nvmlDev 4 pytorch-fairseq-master-0:18:44 [2] NCCL INFO Setting affinity for GPU 2 to 3ff003ff pytorch-fairseq-master-0:18:44 [2] NCCL INFO comm 0x7fae14001ac0 rank 2 nranks 8 cudaDev 2 nvmlDev 5 pytorch-fairseq-master-0:19:45 [3] NCCL INFO Setting affinity for GPU 3 to 3ff003ff pytorch-fairseq-master-0:19:45 [3] NCCL INFO comm 0x7f5b84001ac0 rank 3 nranks 8 cudaDev 3 nvmlDev 6 pytorch-fairseq-master-0:16:42 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff pytorch-fairseq-master-0:16:42 [0] NCCL INFO comm 0x7f789c001ac0 rank 0 nranks 8 cudaDev 0 nvmlDev 3
pytorch-fairseq-master-0:16:41 [0] bootstrap.cu:102 NCCL WARN Bootstrap Root : rank 3 of 8 ranks has already checked in `
env:
k8s 1.13.5 cuda 10.0.130 cudnn 7.5 python 3.6 pytorch 1.1.0 nccl 2.4.2 fairseq 0.8.0
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (1 by maintainers)
Top GitHub Comments
It’s all-reduce architecture communicate with NCCL libs via Gigabit Ethernet(socket) or InfiniBand. You can get more information about NCCL variables at this link: NCCL variables. I am used to deploying the k8s pod of master & worker node by pytorch-operator and training data on distributed storage. System environment variable such as:
MASTER_SOCKET_IFNAME=eth0 # or ib device name NCCL_IB_CUDA_SUPPORT=0 NCCL_P2P_DISABLE=0 NCCL_DEBUG=INFO NCCL_IB_DISABLE=1 # ib not set DIST_SIZE=16 DIST_RANK=0 # or 8. master node is 0, worker node is 8 if 2 nodes each with 8 GPUs NCCL_SOCKET_IFNAME=eth0 NCCL_NET_GDR_LEVEL=3 NCCL_NET_GDR_READ=0 NCCL_SHM_DISABLE=0 MASTER_IP=xxxx
start script as:
/opt/conda/bin/fairseq-train /workspace/data/train_data --save-dir /data/checkpoints \ --distributed-world-size ${DIST_SIZE} --distributed-rank ${DIST_RANK} --ddp-backend no_c10d --distributed-init-method 'tcp://'${MASTER_IP}':23456' --distributed-port 23456 ....
some of the parameters may need to be adjusted according to the environmentI started the multi-machine job directly by “/opt/conda/bin/fairseq-train /workspace/data/train_data --save-dir /…”. Problem is solved after setting "–distributed-world-size ${DIST_SIZE} --distributed-rank ${DIST_RANK} " DIST_SIZE: Total number of GPUs. 2 nodes each with 8 GPUs:16 ;2 nodes each with 4 GPUs:8 DIST_RANK:2 nodes each with 8 GPUs:first node is 0;second node is 8; 2 nodes each with 4 GPUs:first node is 0;second node is 4 Hope it helps