Hyperparameters of AdamW
See original GitHub issueIssue Description
Table 2 of your paper shows that AdamW on ImageNet is as good as SGDM, which is very excited. Would like to share with us the hyperparameters? Thx!
-
I guess from your paper ResNet + AdamW is
AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0001, amsgrad=False)
, Is it right? However, I have done an experiment with the above setting and it was two points lower than your result. I’m confused -
What is the hyperparameter of MobileNetV2+AdamW
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top Results From Across the Web
A Hyperparameters - NIPS papers
Identical hyperparameters are used to train both models (AdamW optimizer, weight decay 0.05 and learning rate 1e-3). We use max input size 512...
Read more >Hyperparameter Tuning Analysis – Weights & Biases - WandB
We used W&B Sweeps to do hyperparameter search. ... It's a close competition between AdamW and Adam but seems like Adam is winning...
Read more >Optimization - Hugging Face
Implements Adam algorithm with weight decay fix as introduced in Decoupled ... replace AdamW with Adafactor optimizer = Adafactor( model.parameters(), ...
Read more >AdamW and Super-convergence is now the fastest way to ...
Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to ...
Read more >In what order should we tune hyperparameters in Neural ...
For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source); For the learning...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi
Thank you for your interest in our paper.
For
torch.optim.AdamW
, you have to useweight_decay=0.1
. InAdamW
paper, they decoupled the weight decay which meansw = (1 - weight_decay)w
But, PyTorch implementation isw = (1 - lr * weight_decay) w
https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/optim/adamw.py#L73 It makes it easy to utilize the learning rate scheduler for weight decay but requires changing parameters.In the paper, we followed the notation of
AdamW
paper. Solr=1e-3, weight_decay=0.1
is the PyTorch parameter for weight decay 1e-4.You can find a similar setting on NovoGrad paper https://arxiv.org/pdf/1905.11286.pdf
5e-3
is correct.torch.optim.AdamW(param, lr=2e-3, weight_decay=5e-3)
It is
1e-5
in paper notation.