Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hyperparameters of AdamW

See original GitHub issue

Table 2 of your paper shows that AdamW on ImageNet is as good as SGDM, which is very excited. Would like to share with us the hyperparameters? Thx!

I guess from your paper ResNet + AdamW is AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0001, amsgrad=False), Is it right? However, I have done an experiment with the above setting and it was two points lower than your result. I’m confused
What is the hyperparameter of MobileNetV2+AdamW

Issue Analytics

State:
Created 2 years ago
Comments:7

Top GitHub Comments

5reactions

bhheocommented, Jun 15, 2021

Thank you for your interest in our paper.

For torch.optim.AdamW, you have to use weight_decay=0.1. In AdamW paper, they decoupled the weight decay which means w = (1 - weight_decay)w But, PyTorch implementation is w = (1 - lr * weight_decay) w https://github.com/pytorch/pytorch/blob/b31f58de6fa8bbda5353b3c77d9be4914399724d/torch/optim/adamw.py#L73 It makes it easy to utilize the learning rate scheduler for weight decay but requires changing parameters.

In the paper, we followed the notation of AdamW paper. So lr=1e-3, weight_decay=0.1 is the PyTorch parameter for weight decay 1e-4.

You can find a similar setting on NovoGrad paper https://arxiv.org/pdf/1905.11286.pdf

1reaction

bhheocommented, Jun 15, 2021

5e-3 is correct. torch.optim.AdamW(param, lr=2e-3, weight_decay=5e-3)

It is 1e-5 in paper notation.

Top Results From Across the Web

A Hyperparameters - NIPS papers

Identical hyperparameters are used to train both models (AdamW optimizer, weight decay 0.05 and learning rate 1e-3). We use max input size 512...

Hyperparameter Tuning Analysis – Weights & Biases - WandB

We used W&B Sweeps to do hyperparameter search. ... It's a close competition between AdamW and Adam but seems like Adam is winning...

Optimization - Hugging Face

Implements Adam algorithm with weight decay fix as introduced in Decoupled ... replace AdamW with Adafactor optimizer = Adafactor( model.parameters(), ...

AdamW and Super-convergence is now the fastest way to ...

Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to ...

In what order should we tune hyperparameters in Neural ...

For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source); For the learning...