question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hyperparameters of AdamW

See original GitHub issue

image Table 2 of your paper shows that AdamW on ImageNet is as good as SGDM, which is very excited. Would like to share with us the hyperparameters? Thx!

  • I guess from your paper ResNet + AdamW is AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0001, amsgrad=False), Is it right? However, I have done an experiment with the above setting and it was two points lower than your result. I’m confused

  • What is the hyperparameter of MobileNetV2+AdamW

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7

github_iconTop GitHub Comments

5reactions
bhheocommented, Jun 15, 2021

Hi

Thank you for your interest in our paper.

For torch.optim.AdamW, you have to use weight_decay=0.1. In AdamW paper, they decoupled the weight decay which means w = (1 - weight_decay)w But, PyTorch implementation is w = (1 - lr * weight_decay) w https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/optim/adamw.py#L73 It makes it easy to utilize the learning rate scheduler for weight decay but requires changing parameters.

In the paper, we followed the notation of AdamW paper. So lr=1e-3, weight_decay=0.1 is the PyTorch parameter for weight decay 1e-4.

You can find a similar setting on NovoGrad paper https://arxiv.org/pdf/1905.11286.pdf

1reaction
bhheocommented, Jun 15, 2021

5e-3 is correct. torch.optim.AdamW(param, lr=2e-3, weight_decay=5e-3)

It is 1e-5 in paper notation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Hyperparameters - NIPS papers
Identical hyperparameters are used to train both models (AdamW optimizer, weight decay 0.05 and learning rate 1e-3). We use max input size 512...
Read more >
Hyperparameter Tuning Analysis – Weights & Biases - WandB
We used W&B Sweeps to do hyperparameter search. ... It's a close competition between AdamW and Adam but seems like Adam is winning...
Read more >
Optimization - Hugging Face
Implements Adam algorithm with weight decay fix as introduced in Decoupled ... replace AdamW with Adafactor optimizer = Adafactor( model.parameters(), ...
Read more >
AdamW and Super-convergence is now the fastest way to ...
Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to ...
Read more >
In what order should we tune hyperparameters in Neural ...
For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source); For the learning...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found