Does (horovod + learning rate decay) work properly?
See original GitHub issueHi,
I’m using Horovod on 3 GPUs(in a single machine) with learning rate decayReduceLROnPlateau
.
But I found something strange while looking at the printed log.
It seems that the learning rate for each process(GPU) is not synchronized.
Actually, this is my first time using Horovod and I don’t know what’s going on inside Horovod. I would appreciate it if you could let me know if I am mistaken!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (5 by maintainers)
Top Results From Across the Web
AdaSum with Horovod
Scaling DNN training to many GPUs always comes at a convergence degradation. This is because with larger batch sizes, gradients are averaged and...
Read more >Why should we scale the learning rate? · Issue #384 - GitHub
The idea is to scale the learning rate linearly with the batch size to preserve the number of epochs needed for the model...
Read more >Distributed Deep Learning with Horovod | NVIDIA
How does Deep Learning training work? ... import horovod.tensorflow as hvd ... Google published a paper “Don't Decay the Learning Rate, Increase the....
Read more >Why is your Horovod slower than the usual?
This article discusses what can be done to train faster with Horovod and some common bottlenecks that could cause a slow down on...
Read more >Scaling Deep Learning Training - Cray User Group
Training with large learning rates is not stable in the initial stages of ... Linear scaling of learning-rate (N * η) followed by...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey, thanks for the report, that’s a great catch! Passing
avg_loss
instead ofloss
into the scheduler should already fix this! Sorry for the issue. 😃 That way, all processes use the same loss, so the scheduler stays the same across all workers.Yeah, exactly! Sorry I wasn’t clearer in my description. Thank you!