How to use Adam instead of SGD?
See original GitHub issueAll the optimizers are defined as:
optimizer = dict(type='SGD', lr=2e-3, momentum=0.9, weight_decay=5e-4)
But I want to change it to Adam, how should I do ?
Can anyone give me an example?Thx!
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top Results From Across the Web
A 2021 Guide to improving CNNs-Optimizers: Adam vs SGD
For now, we could say that fine-tuned Adam is always better than SGD, while there exists a performance gap between Adam and SGD...
Read more >Why not always use the ADAM optimization technique?
Adam is faster to converge. SGD is slower but generalizes better. So at the end it all depends on your particular circumstances.
Read more >Adam vs. SGD: Closing the generalization gap on image ...
1 Adam finds solutions that generalize worse than those found by SGD [3, 4, 6]. Even when Adam achieves the same or lower...
Read more >Why Should Adam Optimizer Not Be the Default Learning ...
To summarize, Adam definitely converges rapidly to a “sharp minima” whereas SGD is computationally heavy, converges to a “flat minima” but ...
Read more >Deep Learning Optimizers. SGD with momentum, Adagrad ...
Adam optimizer is by far one of the most preferred optimizers. The idea behind Adam optimizer is to utilize the momentum concept from...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@GYee
As far as I used Adam instead of SGD, expecting faster training speed, the loss diverged to larger than 1000 so I stopped the training. I haven’t got a satisfactory training result, but at least training was stable using SGD.
The loss diverged simply because the learning rate=0.02 is too large for Adam. Try 1e-3 or 3e-4 and you will get reasonable results. However, the results are still much lower than that of using SGD (bbox mAP 31.1 and segm mAP 28.6 with 3e-4 on Mask R-CNN using ResNet-50). We suggest more hyper-parameter tuning when using Adam if Adam is inevitable in your experiments.