Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DPP is slow

See original GitHub issue

Hi, when I try to use distributed data parallel in single node, it works well. However, when I use multi-node, it didn’t work. I use speaker recognition-VoxCeleb Ecapa-TDNN recipe. Batch size is set to 8. I added train_data = train_data.filtered_sorted(select_n=30000) to here. So I trained with 30000 sample for avoiding long training times. I trained models for 1 epoch. Results:

Start Server, Devices:number of node	Elapsed Time (min)	Notes
server1, server1:1	27:56	1 epoch, Torch.distributed.launch + nccl
server1, server1:2	14:22	1 epoch, Torch.distributed.launch + nccl
server1, server1:3	09:57	1 epoch, Torch.distributed.launch + nccl
server1, server1:4	07:57	1 epoch, Torch.distributed.launch + nccl
server1, server1:2, server2:2	27:27	1 epoch, torchrun
server1, server1:2, server2:4	27:25	1 epoch, torchrun
server1, server1:2, server2:2	28:18	1 epoch, Torch.distributed.launch + gloo

First, I conducted experiments with a single-node multi-gpu setup. I increase number of GPUs for first 4 experiments.

Second, I tried multi-node setup. NCCL didn’t work. When I run scripts, script is freezing during ddp_init_group(). It is close to these issues 1 2 3 4. There might be a deadlock, but I couldn’t solve this.

Third, since torch.distributed.launch + nccl didn’t work, I tried torchrun. My start command is changed like: from python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl' to torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml

Lastly, I tried torch.distributed.launch with gloo backend (multi-node setup). python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr server1 --master_port 55554 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='gloo'

As a result, ddp in single node decreases training time , but doesn’t decrease in multi-node.

I also tested my network connection with iperf and I got 625 Mbits/sec bandwidth.

Is there anyone to face with this problem and solve ?

Some package versions in my environment:

speechbrain 0.5.11
pytorch 1.10.1
pytorch-lightning 1.5.10

Issue Analytics

State:
Created a year ago
Comments:8

Top GitHub Comments

1reaction

EmreOzkosecommented, Jun 6, 2022

I just export a conda environment yaml file, then create a new environment in server2. Then run same commands with different node ranks. Command:

python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr server1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'.

Thank you for discussion, but I guess my main issue is network. I used Docker to ensure that if network is problem.

I conducted experiments with Docker containers. My Dockerfile:

FROM ubuntu:20.04 as base-python

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && \
    apt-get install -y python3-pip python3-dev gcc && \
    apt-get install -y libsndfile1 && \
    apt-get install -y ffmpeg && \
    apt-get install -y git-all && \
    apt-get install -y vim && \
    apt-get install -y wget && \
    apt-get install -y libopenmpi-dev && \
    apt-get install -y net-tools && \
    apt-get install -y sox && \
    apt-get install -y iperf && \
    apt-get install -y iproute2 && \
    apt-get clean

COPY requirements_sb.txt .

RUN pip install --no-cache-dir --upgrade -r requirements_sb.txt

RUN git clone https://github.com/speechbrain/speechbrain.git
WORKDIR /speechbrain/recipes/VoxCeleb/SpeakerRec

RUN mkdir /speechbrain/recipes/VoxCeleb/SpeakerRec/results
RUN mkdir -p /path/to/voxceleb2/dev_and_v1test

COPY train_ecapa_tdnn_dist.yaml /speechbrain/recipes/VoxCeleb/SpeakerRec/hparams
COPY train_speaker_embeddings.py /speechbrain/recipes/VoxCeleb/SpeakerRec
COPY results /speechbrain/recipes/VoxCeleb/SpeakerRec/results

EXPOSE 5000

CMD ["/bin/bash"]

I run 2 different container at the same time in the same node with different GPU devices. I don’t limit hardware (memory, network, etc…). I check docker container IP addresses with docker inspect -f "{{ .NetworkSettings.IPAddress }}" container_name. Then my run command is

First docker container: python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr docker_container_address_1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'.

and

Second docker container: python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr docker_container_address_1 --master_port 53534 train_speaker_embeddings.py hparams/train_ecapa_tdnn_dist.yaml --distributed_launch --distributed_backend='nccl'.

Trainings:

#	Start Server, Devices	Elapsed Time (min)	Notes
1	server1, server1:1	32:29	1 epoch, 30000 training samples, nccl
2	server1, server1:1, server2:1	16:23
3	server1, server1:2, server2:1	11:15
4	server1, server1:2, server2:2	08:12

Until here, all elapsed times are expected values. However, when I limit network bandwidth for 2 containers:

https://stackoverflow.com/questions/25497523/how-can-i-rate-limit-network-traffic-on-a-docker-container $ tc qdisc add dev eth0 handle 1: ingress $ tc qdisc add dev eth0 root tbf rate 10gbit latency 25ms burst 10k

Results :

#	Start Server, Devices	Elapsed Time (min)	Notes
1	server1, server1:1	27:58	limiting bandwidth=1gbit
2	server1, server1:1, server2:1	17:05	limiting bandwidth=10gbit

As a results, the issue is network bandwidth problem.

0reactions

TParcolletcommented, Jun 3, 2022

Could you please give me the command you type in both servers ?

Top Results From Across the Web

DPP 4 is horrible slow, what to use?: Canon EOS R Talk Forum

I have always used DPP 3 and it was excellent and fast. It turns out that DPP 4 is so slow to open...

Digital Photo Professional VERY slow when set to U...

Two persistent issues with DPP that have been going on for years is a "memory leak" during long editing sessions and the lack...

DPP 4 is horrible slow, what to use? : r/canon - Reddit

DPP4 is slow but use the 'quick check' function for reviewing and it's much faster. Part of the problem with the normal view...

DPP4 hell slow? | Canon Rumors

It turns out that DPP 4 is so slow to open the new .CR3 files making it impossible to make the selection of...

DPP 4 is horrible slow, what to use? - Wetpixel.com

It turns out that DPP 4 is so slow to open the new .CR3 files making it impossible to make the selection of...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

DPP is slow

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[Bug] huggingface_hub.cached_download() takes 1 positional argument but 2 were given

wav2vec2 with freeze=False and facebook/wav2vec2-base-960h model do not learn