question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow progress using accelerate on multi-node

See original GitHub issue

When using multi-GPU via the accelerate scripts, performance is improved, however when doing multi-node-multi-GPU performance degrades below usability.

Benchmarks:

  1. Single P4 GPU: 1.8 it/sec Iteration: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 20/20 [00:26<00:00, 1.33s/it]
  2. Dual P4 GPU (Same host): 2.21 it/sec Iteration: 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 14/20 [00:21<00:02, 2.21it/s]
  3. Quad P4 GPU (Two hosts): 285.15 sec/it Iteration: 2%|β–ˆβ–ˆβ–Š | 12/500 [32:55<38:39:10, 285.15s/it]

Opening this issue as a way to track resolution for public information.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:13 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
deepglugscommented, Sep 10, 2022

I was able to get reasonable results after switching my dataloader to use webdataset. There was likely a huge bottleneck in my custom dataloader that was causing 200+s/it even with only two GPUs. I was able to get 4 GPUs going, but noticed loss was 0.0 (and stays at 0.0). I’ll mess around with the learning rate and see if that helps (update: it didn’t).

2reactions
haukenedcommented, Sep 9, 2022

Wait. What’s a vacation?

Sent from my iPhone

On Sep 8, 2022, at 6:20 PM, Phil Wang @.***> wrote:

@muellerzr have a great vacay! I need a vacation πŸ€”

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Launching Multi-Node Training from a Jupyter Environment
This tutorial teaches you how to fine tune a computer vision model with Accelerate from a Jupyter Notebook on a distributed system. You...
Read more >
Hadoop multinode cluster too slow. How do I increase speed ...
I have an estimation of MR job based on input data size - 300GB of input data takes around 24 hours to process....
Read more >
Training on multiple GPUs and multi-node training ... - YouTube
In this video we'll cover how multi-GPU and multi-node training works in general.We'll also show how to do this using PyTorchΒ ...
Read more >
Multinode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale
cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and engineers to solve challenging problems on exascaleΒ ...
Read more >
Multi-GPU, Multi-Node Algorithms for Acceleration of ... - NCBI
Sensor resolution is limited by the number of electrodes and their size. In 3D ECT usually three or four rings with 8, 12...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found