Multi-gpu is taking more time than single gpu
See original GitHub issueI tried running the downstream ASR model using both single-gpu and multi-gpu (DDP) settings:
Single-gpu command:
python3 run_downstream.py -m train -n asr_tera -u tera -d asr
Multi-gpu command:
distributed="-m torch.distributed.launch --nproc_per_node 4";
python3 $distributed run_downstream.py -m train -n asr_tera_ddp -u tera -d asr -o config.runner.gradient_accumulate_steps=2
However, the multi-gpu code is taking more time than the single-gpu one. Time taken by multi-gpu setting: ~6 days (using 4 gpus) Time taken by single-gpu setting: ~3 days (using 1 gpu)
Do you have any idea as to why is this happening? Have you tested the code using DDP?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Multi system multi gpu distributed training slower than single ...
Try to run the network with random data generated in memory instead of reading from the disk. (Your drive is probably EBS which...
Read more >Why My Multi-GPU training is slow? | by Chuan Li | Medium
The more GPUs you use, the bigger overhead they bring. This won't be a problem if the GPU computation cycle (one forward step...
Read more >Multi GPU training slower than single GPU on Tensorflow
Multi GPU training slower than single GPU on Tensorflow ... I have created 3 virtual GPU's (have 1 GPU) and try to speedup...
Read more >Multi-GPUs is slower than single GPU - OpenNMT Forum
I use 4 x GTX1080Ti 11GB GPUs. But I found it's even slower than only using one single GPU. My torch version is...
Read more >Multiple GPU slower than single GPU or even CPU - MathWorks
Learn more about image processing, gpu, multi-gpu Image Processing Toolbox. ... Moreover, the single GPU and the CPU processing time is about the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi, here is the detailed config used in both the settings:
Single GPU:
Multiple GPUs (DDP):
Hi,
Sorry for the late reply, and thanks for the detailed information! In my point of view, the following two settings should be equivalent:
vs.
or
vs.
The above two comparisons are equivalent. However, the comparison setting you share is not equivalent in my point of view, since the second one actually uses large effective batch size (24) and fewer steps (100000). In practice, usually we use larger batch size to get more accurate gradient so that the training converges faster and requires fewer steps. However, this does not mean that 2x batch size & 0.5x training steps is mathematically equivalent to 1x batch size & 1x training steps. In your case, it seems larger batch size does not help or even get worse WER. In my experience, our ASR setting won’t benefit from larger batch size which kind of aligning to your results.
Hence, the above result does not look very weird to me. Please feel free to point out my mistake if you think I am wrong. Thanks!
I am closing this issue for now. Feel free to re-open it!
Sincerely, Leo