Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

why not AdamW style weight decay

See original GitHub issue

Hello,

While translating your optimizer to Flax (here), I noticed that you are using a traditional weight decay were you add the weight decay to the gradient (here in your implementation):

grad += weight_decay * parameters

Rather than an AdamW style weight decay (which, I believe, is now the default for most optimizers) were you would subtract the weight decay time the learning rate just before returning the parameters:

updated_parameters -= learning_rate * weight_decay * param

Is there a particular reason for that decision ?

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:16 (8 by maintainers)

Top GitHub Comments

3reactions

adefaziocommented, Apr 6, 2021

I will look into adding the adamw style weight decay as an option, thanks for the discussion and results!

2reactions

adefaziocommented, Feb 7, 2022

I’m going to add adamw style averaging to the implementation this week as it seems popular based on the comments here.

Top Results From Across the Web

Understanding L2 regularization, Weight decay and AdamW

A post explaining L2 regularization, Weight decay and AdamW optimizer as described in the paper Decoupled Weight Decay Regularization we ...

AdamW and Adam with weight decay - pytorch - Stack Overflow

Yes, Adam and AdamW weight decay are different. Hutter pointed out in their paper (Decoupled Weight Decay Regularization) that the way ...

Why AdamW matters. Adaptive optimizers like Adam have…

The idea behind L2 regularization or weight decay is that networks with smaller weights (all other things being equal) are observed to overfit...

Stable Weight Decay Regularization | OpenReview

Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can outperform complex Adam variants, which have more hyperparameters.

[1711.05101] Decoupled Weight Decay Regularization - arXiv

Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

why not AdamW style weight decay

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

WebSocket Connection Interrupted on LabGraph Monitor

Install Instructions for Docker fail