why not AdamW style weight decay
See original GitHub issueHello,
While translating your optimizer to Flax (here), I noticed that you are using a traditional weight decay were you add the weight decay to the gradient (here in your implementation):
grad += weight_decay * parameters
Rather than an AdamW style weight decay (which, I believe, is now the default for most optimizers) were you would subtract the weight decay time the learning rate just before returning the parameters:
updated_parameters -= learning_rate * weight_decay * param
Is there a particular reason for that decision ?
Issue Analytics
- State:
- Created 2 years ago
- Reactions:5
- Comments:16 (8 by maintainers)
Top Results From Across the Web
Understanding L2 regularization, Weight decay and AdamW
A post explaining L2 regularization, Weight decay and AdamW optimizer as described in the paper Decoupled Weight Decay Regularization we ...
Read more >AdamW and Adam with weight decay - pytorch - Stack Overflow
Yes, Adam and AdamW weight decay are different. Hutter pointed out in their paper (Decoupled Weight Decay Regularization) that the way ...
Read more >Why AdamW matters. Adaptive optimizers like Adam have…
The idea behind L2 regularization or weight decay is that networks with smaller weights (all other things being equal) are observed to overfit...
Read more >Stable Weight Decay Regularization | OpenReview
Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can outperform complex Adam variants, which have more hyperparameters.
Read more >[1711.05101] Decoupled Weight Decay Regularization - arXiv
Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I will look into adding the adamw style weight decay as an option, thanks for the discussion and results!
I’m going to add adamw style averaging to the implementation this week as it seems popular based on the comments here.