DPP is slow
See original GitHub issueHi, when I try to use distributed data parallel in single node, it works well. However, when I use multi-node, it didn’t work. I use speaker recognition-VoxCeleb Ecapa-TDNN recipe. Batch size is set to 8. I added train_data = train_data.filtered_sorted(select_n=30000)
to here. So I trained with 30000 sample for avoiding long training times. I trained models for 1 epoch. Results:
Start Server, Devices:number of node | Elapsed Time (min) | Notes |
---|---|---|
server1, server1:1 | 27:56 | 1 epoch, Torch.distributed.launch + nccl |
server1, server1:2 | 14:22 | 1 epoch, Torch.distributed.launch + nccl |
server1, server1:3 | 09:57 | 1 epoch, Torch.distributed.launch + nccl |
server1, server1:4 | 07:57 | 1 epoch, Torch.distributed.launch + nccl |
server1, server1:2, server2:2 | 27:27 | 1 epoch, torchrun |
server1, server1:2, server2:4 | 27:25 | 1 epoch, torchrun |
server1, server1:2, server2:2 | 28:18 | 1 epoch, Torch.distributed.launch + gloo |
First, I conducted experiments with a single-node multi-gpu setup. I increase number of GPUs for first 4 experiments.
Second, I tried multi-node setup. NCCL didn’t work. When I run scripts, script is freezing during ddp_init_group(). It is close to these issues 1 2 3 4. There might be a deadlock, but I couldn’t solve this.
Third, since torch.distributed.launch + nccl didn’t work, I tried torchrun
. My start command is changed like:
from python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'
to torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml
Lastly, I tried torch.distributed.launch with gloo backend (multi-node setup).
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='gloo'
As a result, ddp in single node decreases training time , but doesn’t decrease in multi-node.
I also tested my network connection with iperf and I got 625 Mbits/sec bandwidth.
Is there anyone to face with this problem and solve ?
Some package versions in my environment:
- speechbrain 0.5.11
- pytorch 1.10.1
- pytorch-lightning 1.5.10
Issue Analytics
- State:
- Created a year ago
- Comments:8
Top GitHub Comments
I just export a conda environment yaml file, then create a new environment in server2. Then run same commands with different node ranks. Command:
python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr server1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'
.Thank you for discussion, but I guess my main issue is network. I used Docker to ensure that if network is problem.
I conducted experiments with Docker containers. My Dockerfile:
I run 2 different container at the same time in the same node with different GPU devices. I don’t limit hardware (memory, network, etc…). I check docker container IP addresses with
docker inspect -f "{{ .NetworkSettings.IPAddress }}" container_name
. Then my run command isFirst docker container:
python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr docker_container_address_1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'
.and
Second docker container:
python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr docker_container_address_1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'
.Trainings:
Until here, all elapsed times are expected values. However, when I limit network bandwidth for 2 containers:
https://stackoverflow.com/questions/25497523/how-can-i-rate-limit-network-traffic-on-a-docker-container
$ tc qdisc add dev eth0 handle 1: ingress
$ tc qdisc add dev eth0 root tbf rate 10gbit latency 25ms burst 10k
Results :
As a results, the issue is network bandwidth problem.
Could you please give me the command you type in both servers ?