question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hi, when I try to use distributed data parallel in single node, it works well. However, when I use multi-node, it didn’t work. I use speaker recognition-VoxCeleb Ecapa-TDNN recipe. Batch size is set to 8. I added train_data = train_data.filtered_sorted(select_n=30000) to here. So I trained with 30000 sample for avoiding long training times. I trained models for 1 epoch. Results:

Start Server, Devices:number of node Elapsed Time (min) Notes
server1, server1:1 27:56 1 epoch, Torch.distributed.launch + nccl
server1, server1:2 14:22 1 epoch, Torch.distributed.launch + nccl
server1, server1:3 09:57 1 epoch, Torch.distributed.launch + nccl
server1, server1:4 07:57 1 epoch, Torch.distributed.launch + nccl
server1, server1:2, server2:2 27:27 1 epoch, torchrun
server1, server1:2, server2:4 27:25 1 epoch, torchrun
server1, server1:2, server2:2 28:18 1 epoch, Torch.distributed.launch + gloo

First, I conducted experiments with a single-node multi-gpu setup. I increase number of GPUs for first 4 experiments.

Second, I tried multi-node setup. NCCL didn’t work. When I run scripts, script is freezing during ddp_init_group(). It is close to these issues 1 2 3 4. There might be a deadlock, but I couldn’t solve this.

Third, since torch.distributed.launch + nccl didn’t work, I tried torchrun. My start command is changed like: from python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl' to torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml

Lastly, I tried torch.distributed.launch with gloo backend (multi-node setup). python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='gloo'

As a result, ddp in single node decreases training time , but doesn’t decrease in multi-node.

I also tested my network connection with iperf and I got 625 Mbits/sec bandwidth.

Is there anyone to face with this problem and solve ?

Some package versions in my environment:

  • speechbrain 0.5.11
  • pytorch 1.10.1
  • pytorch-lightning 1.5.10

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
EmreOzkosecommented, Jun 6, 2022

I just export a conda environment yaml file, then create a new environment in server2. Then run same commands with different node ranks. Command:

python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr server1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'.

Thank you for discussion, but I guess my main issue is network. I used Docker to ensure that if network is problem.

I conducted experiments with Docker containers. My Dockerfile:

FROM ubuntu:20.04 as base-python

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && \
    apt-get install -y python3-pip python3-dev gcc && \
    apt-get install -y libsndfile1 && \
    apt-get install -y ffmpeg && \
    apt-get install -y git-all && \
    apt-get install -y vim && \
    apt-get install -y wget && \
    apt-get install -y libopenmpi-dev && \
    apt-get install -y net-tools && \
    apt-get install -y sox && \
    apt-get install -y iperf && \
    apt-get install -y iproute2 && \
    apt-get clean

COPY requirements_sb.txt .

RUN pip install --no-cache-dir --upgrade -r requirements_sb.txt

RUN git clone https://github.com/speechbrain/speechbrain.git
WORKDIR /speechbrain/recipes/VoxCeleb/SpeakerRec

RUN mkdir /speechbrain/recipes/VoxCeleb/SpeakerRec/results
RUN mkdir -p /path/to/voxceleb2/dev_and_v1test

COPY train_ecapa_tdnn_dist.yaml /speechbrain/recipes/VoxCeleb/SpeakerRec/hparams
COPY train_speaker_embeddings.py /speechbrain/recipes/VoxCeleb/SpeakerRec
COPY results /speechbrain/recipes/VoxCeleb/SpeakerRec/results

EXPOSE 5000

CMD ["/bin/bash"]

I run 2 different container at the same time in the same node with different GPU devices. I don’t limit hardware (memory, network, etc…). I check docker container IP addresses with docker inspect -f "{{ .NetworkSettings.IPAddress }}" container_name. Then my run command is

First docker container: python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr docker_container_address_1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'.

and

Second docker container: python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr docker_container_address_1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'.

Trainings:

# Start Server, Devices Elapsed Time (min) Notes
1 server1, server1:1 32:29 1 epoch, 30000 training samples, nccl
2 server1, server1:1, server2:1 16:23
3 server1, server1:2, server2:1 11:15
4 server1, server1:2, server2:2 08:12

Until here, all elapsed times are expected values. However, when I limit network bandwidth for 2 containers:

https://stackoverflow.com/questions/25497523/how-can-i-rate-limit-network-traffic-on-a-docker-container $ tc qdisc add dev eth0 handle 1: ingress $ tc qdisc add dev eth0 root tbf rate 10gbit latency 25ms burst 10k

Results :

# Start Server, Devices Elapsed Time (min) Notes
1 server1, server1:1 27:58 limiting bandwidth=1gbit
2 server1, server1:1, server2:1 17:05 limiting bandwidth=10gbit

As a results, the issue is network bandwidth problem.

0reactions
TParcolletcommented, Jun 3, 2022

Could you please give me the command you type in both servers ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

DPP 4 is horrible slow, what to use?: Canon EOS R Talk Forum
I have always used DPP 3 and it was excellent and fast. It turns out that DPP 4 is so slow to open...
Read more >
Digital Photo Professional VERY slow when set to U...
Two persistent issues with DPP that have been going on for years is a "memory leak" during long editing sessions and the lack...
Read more >
DPP 4 is horrible slow, what to use? : r/canon - Reddit
DPP4 is slow but use the 'quick check' function for reviewing and it's much faster. Part of the problem with the normal view...
Read more >
DPP4 hell slow? | Canon Rumors
It turns out that DPP 4 is so slow to open the new .CR3 files making it impossible to make the selection of...
Read more >
DPP 4 is horrible slow, what to use? - Wetpixel.com
It turns out that DPP 4 is so slow to open the new .CR3 files making it impossible to make the selection of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found