question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Two bugs in AdamW

See original GitHub issue

Environment info

  • transformers version: 4.13.0.dev0
  • Platform: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.9.7
  • PyTorch version (GPU?): 1.10.0+cu113 (True)
  • Tensorflow version (GPU?): 2.7.0 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@thomwolf and @stas00 should be able to help based on git blame

Information

There are two bugs in the implementation of AdamW.

Here’s the current code https://github.com/manuelciosici/transformers/blob/04683c0659aacf31a1e1df8aa2e6cf7b447a6f12/src/transformers/optimization.py#L324-L371

Weight decay bug

Look at lines 369-370. The weight decay is multiplied with p.data which no longer corresponds to theta_{t-1} since p.data was modified in line 369. Below is a picture of Algorithm 2 from the original Adamw paper that shows on line 12 that the weight decay should be multiplied with the previous step’s parameters (i.e., theta_{t-1}).

Screen Shot 2021-11-26 at 09 19 33

From what I can tell, this is a regression since the original AdamW implementation in transformers applied weight decay properly. Here’s the commit that introduces the bug https://github.com/HuggingFace/transformers/commit/ec07cf5a660926833d6f5208b58730e4af8d1178#diff-40c6163602943c11431f1ec360299a7646bb436c691a646b9f54b2284f556ce0

For confirmation that weight decay is currently buggy, see the original AdamW implementation, where, on line 74, the weight decay is multiplied with the old parameters as opposed to the new parameters that are calculated on line 71.

Denominator computation bug

The second bug appears in the computation of the denominator corresponding to line 10 in Algorithm 2 above. In the current code (see link in the Information section), on line 351, the denominator excludes the division by math.sqrt(bias_correction2). On line 357, division by math.sqrt(bias_correction2) appears, but, by this time, eps has already been added to denom, making the division not equivalent to line 10 in Algorithm 10.

From what I can tell, this bug was also introduced as part of commit https://github.com/HuggingFace/transformers/commit/ec07cf5a660926833d6f5208b58730e4af8d1178#diff-40c6163602943c11431f1ec360299a7646bb436c691a646b9f54b2284f556ce0. The previous line update = next_m / (next_v.sqrt() + group['e']) was correct.

For confirmation that the denominator is not properly calculated, see the original AdamW implementation, where, on line 64 the denominator is computed.

To reproduce

Steps to reproduce the behavior:

  1. Checkout the branch at https://github.com/manuelciosici/transformers/tree/reveal_broken_adamw:
  2. Run the unit tests in tests/test_optimization.py
  3. Tests test_compare_adamw_no_weight_decay and test_compare_adamw_with_weight_decay should fail (see the attached failed_tests.txt)

Expected behavior

The two implementations of AdamW should match their parameter updates.

Proposed fix

Checkout the branch at https://github.com/manuelciosici/transformers/tree/fix_adamw . It contains both the unit tests above and a fix for both bugs mentioned above.

I can make a PR once we agree on the two bugs and the fix.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:16 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
stas00commented, Dec 9, 2021

The NVIDIA engineers have been profiling a few things and torch’s AdamW is faster than ours (apparently apex’s is even faster), so I will add this to the performance docs once I’m able to benchmark this when your PR is ready, @manuelciosici

https://github.com/huggingface/transformers/pull/14708

2reactions
manuelciosicicommented, Dec 3, 2021

@stas00 Thank you. I work on this during the weekend.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ADAM BUXTON'S 'BUG' PROG. 2 - YouTube
TV Version of ' BUG ', a live show featuring music videos and related floofle presented by Adam Buxton.FIND OUT ABOUT ' BUG...
Read more >
Adam Ruins Everything - We Should All Eat Bugs ... - YouTube
Four chefs face off against one another to prepare dishes with bugs in them. In the end, one chef gets "squished.
Read more >
ADAM BUXTON'S 'BUG' PROG. 7 - YouTube
TV Version of ' BUG ', a live show featuring music videos and a kind of comedy presented by Adam Buxton.FIND OUT ABOUT...
Read more >
Adam's Pest Control Lady Bug 30 Second Commercial
Adam's Pest Control Lady Bug 30 Second Commercial. 140 views 8 years ago. Adam's Pest Control. Adam's Pest Control. 129 subscribers.
Read more >
Antelope Island In Utah Is Infested With Millions Of Bugs
Antelope Island In Utah Is Infested With Millions Of Bugs - Exploring Farm Relics & Bison Watching. Watch later. Share. Copy link.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found