Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow progress using accelerate on multi-node

See original GitHub issue

When using multi-GPU via the accelerate scripts, performance is improved, however when doing multi-node-multi-GPU performance degrades below usability.

Benchmarks:

Single P4 GPU: 1.8 it/sec Iteration: 100%|█████████████████████████████████████████████| 20/20 [00:26<00:00, 1.33s/it]
Dual P4 GPU (Same host): 2.21 it/sec Iteration: 70%|██████████████████████████████▍ | 14/20 [00:21<00:02, 2.21it/s]
Quad P4 GPU (Two hosts): 285.15 sec/it Iteration: 2%|██▊ | 12/500 [32:55<38:39:10, 285.15s/it]

Opening this issue as a way to track resolution for public information.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:13 (8 by maintainers)

Top GitHub Comments

2reactions

deepglugscommented, Sep 10, 2022

I was able to get reasonable results after switching my dataloader to use webdataset. There was likely a huge bottleneck in my custom dataloader that was causing 200+s/it even with only two GPUs. I was able to get 4 GPUs going, but noticed loss was 0.0 (and stays at 0.0). I’ll mess around with the learning rate and see if that helps (update: it didn’t).

2reactions

haukenedcommented, Sep 9, 2022

Wait. What’s a vacation?

Sent from my iPhone

On Sep 8, 2022, at 6:20 PM, Phil Wang @.***> wrote:

@muellerzr have a great vacay! I need a vacation 🤔

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.