Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-gpu is taking more time than single gpu

See original GitHub issue

I tried running the downstream ASR model using both single-gpu and multi-gpu (DDP) settings:

Single-gpu command: python3 run_downstream.py -m train -n asr_tera -u tera -d asr

Multi-gpu command:

distributed="-m torch.distributed.launch --nproc_per_node 4";
python3 $distributed run_downstream.py -m train -n asr_tera_ddp -u tera -d asr -o config.runner.gradient_accumulate_steps=2

However, the multi-gpu code is taking more time than the single-gpu one. Time taken by multi-gpu setting: ~6 days (using 4 gpus) Time taken by single-gpu setting: ~3 days (using 1 gpu)

Do you have any idea as to why is this happening? Have you tested the code using DDP?

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

Sindhu-Hegdecommented, Feb 2, 2022

Hi, here is the detailed config used in both the settings:

Single GPU:

num GPU=1
gradient_accumulate_steps=1
total_steps=200000
batch_size=12
test-clean wer=18.18

Multiple GPUs (DDP):

num GPU=2
gradient_accumulate_steps=2
total_steps=100000
batch_size=6
test-clean wer=20.01

0reactions

leo19941227commented, Apr 18, 2022

Hi,

Sorry for the late reply, and thanks for the detailed information! In my point of view, the following two settings should be equivalent:

num GPU=1
gradient_accumulate_steps=1
total_steps=200000
batch_size=12

vs.

num GPU=2
gradient_accumulate_steps=1
total_steps=200000
batch_size=6

num GPU=1
gradient_accumulate_steps=2
total_steps=100000
batch_size=12

vs.

num GPU=2
gradient_accumulate_steps=1
total_steps=100000
batch_size=12

The above two comparisons are equivalent. However, the comparison setting you share is not equivalent in my point of view, since the second one actually uses large effective batch size (24) and fewer steps (100000). In practice, usually we use larger batch size to get more accurate gradient so that the training converges faster and requires fewer steps. However, this does not mean that 2x batch size & 0.5x training steps is mathematically equivalent to 1x batch size & 1x training steps. In your case, it seems larger batch size does not help or even get worse WER. In my experience, our ASR setting won’t benefit from larger batch size which kind of aligning to your results.

Hence, the above result does not look very weird to me. Please feel free to point out my mistake if you think I am wrong. Thanks!

I am closing this issue for now. Feel free to re-open it!

Sincerely, Leo