Slow progress using accelerate on multi-node
See original GitHub issueWhen using multi-GPU via the accelerate
scripts, performance is improved, however when doing multi-node-multi-GPU performance degrades below usability.
Benchmarks:
- Single P4 GPU: 1.8 it/sec
Iteration: 100%|βββββββββββββββββββββββββββββββββββββββββββββ| 20/20 [00:26<00:00, 1.33s/it]
- Dual P4 GPU (Same host): 2.21 it/sec
Iteration: 70%|βββββββββββββββββββββββββββββββ | 14/20 [00:21<00:02, 2.21it/s]
- Quad P4 GPU (Two hosts): 285.15 sec/it
Iteration: 2%|βββ | 12/500 [32:55<38:39:10, 285.15s/it]
Opening this issue as a way to track resolution for public information.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:13 (8 by maintainers)
Top Results From Across the Web
Launching Multi-Node Training from a Jupyter Environment
This tutorial teaches you how to fine tune a computer vision model with Accelerate from a Jupyter Notebook on a distributed system. You...
Read more >Hadoop multinode cluster too slow. How do I increase speed ...
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process....
Read more >Training on multiple GPUs and multi-node training ... - YouTube
In this video we'll cover how multi-GPU and multi-node training works in general.We'll also show how to do this using PyTorchΒ ...
Read more >Multinode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale
cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and engineers to solve challenging problems on exascaleΒ ...
Read more >Multi-GPU, Multi-Node Algorithms for Acceleration of ... - NCBI
Sensor resolution is limited by the number of electrodes and their size. In 3D ECT usually three or four rings with 8, 12...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was able to get reasonable results after switching my dataloader to use webdataset. There was likely a huge bottleneck in my custom dataloader that was causing 200+s/it even with only two GPUs. I was able to get 4 GPUs going, but noticed loss was 0.0 (and stays at 0.0). Iβll mess around with the learning rate and see if that helps (update: it didnβt).
Wait. Whatβs a vacation?
Sent from my iPhone