question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Does (horovod + learning rate decay) work properly?

See original GitHub issue

Hi, I’m using Horovod on 3 GPUs(in a single machine) with learning rate decayReduceLROnPlateau. But I found something strange while looking at the printed log. image image It seems that the learning rate for each process(GPU) is not synchronized.

Actually, this is my first time using Horovod and I don’t know what’s going on inside Horovod. I would appreciate it if you could let me know if I am mistaken!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
janEbertcommented, May 31, 2021

Hey, thanks for the report, that’s a great catch! Passing avg_loss instead of loss into the scheduler should already fix this! Sorry for the issue. 😃 That way, all processes use the same loss, so the scheduler stays the same across all workers.

1reaction
janEbertcommented, May 31, 2021

Yeah, exactly! Sorry I wasn’t clearer in my description. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

AdaSum with Horovod
Scaling DNN training to many GPUs always comes at a convergence degradation. This is because with larger batch sizes, gradients are averaged and...
Read more >
Why should we scale the learning rate? · Issue #384 - GitHub
The idea is to scale the learning rate linearly with the batch size to preserve the number of epochs needed for the model...
Read more >
Distributed Deep Learning with Horovod | NVIDIA
How does Deep Learning training work? ... import horovod.tensorflow as hvd ... Google published a paper “Don't Decay the Learning Rate, Increase the....
Read more >
Why is your Horovod slower than the usual?
This article discusses what can be done to train faster with Horovod and some common bottlenecks that could cause a slow down on...
Read more >
Scaling Deep Learning Training - Cray User Group
Training with large learning rates is not stable in the initial stages of ... Linear scaling of learning-rate (N * η) followed by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found