question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

why not AdamW style weight decay

See original GitHub issue

Hello,

While translating your optimizer to Flax (here), I noticed that you are using a traditional weight decay were you add the weight decay to the gradient (here in your implementation):

grad += weight_decay * parameters

Rather than an AdamW style weight decay (which, I believe, is now the default for most optimizers) were you would subtract the weight decay time the learning rate just before returning the parameters:

updated_parameters -= learning_rate * weight_decay * param

Is there a particular reason for that decision ?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:5
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
adefaziocommented, Apr 6, 2021

I will look into adding the adamw style weight decay as an option, thanks for the discussion and results!

2reactions
adefaziocommented, Feb 7, 2022

I’m going to add adamw style averaging to the implementation this week as it seems popular based on the comments here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding L2 regularization, Weight decay and AdamW
A post explaining L2 regularization, Weight decay and AdamW optimizer as described in the paper Decoupled Weight Decay Regularization we ...
Read more >
AdamW and Adam with weight decay - pytorch - Stack Overflow
Yes, Adam and AdamW weight decay are different. Hutter pointed out in their paper (Decoupled Weight Decay Regularization) that the way ...
Read more >
Why AdamW matters. Adaptive optimizers like Adam have…
The idea behind L2 regularization or weight decay is that networks with smaller weights (all other things being equal) are observed to overfit...
Read more >
Stable Weight Decay Regularization | OpenReview
Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can outperform complex Adam variants, which have more hyperparameters.
Read more >
[1711.05101] Decoupled Weight Decay Regularization - arXiv
Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found