question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Post Layer Normalization model gradients not matching

See original GitHub issue

Describe the bug Test inputs in test_cuda_backward.py that have is_preln = False will often fail but the same input with is_preln = True will pass. I can see that in the file all the test cases with post layer normalization are commented out. Is post layer normalization no longer supported or is the test case not able to properly test models with post layer normalization?

To Reproduce Steps to reproduce the behavior:

  1. Go to ‘test_cuda_backward.py’
  2. Add a test cases on line 260
(64,1600,128,2,24,True,True, 0.2),
(64,1600,128,2,24,False,True, 0.2),
  1. Run the test_cuda_backward.py.

Expected behavior Both of the test cases should pass however only the test case that uses pre-layer normalization works.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/peter/anaconda3/lib/python3.8/site-packages/torch']
torch version .................... 1.9.1+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/peter/anaconda3/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.1

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count: 1 X RTX 6000
  • Python version: Python 3.8

Launcher context Just running the command locally

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
PeterDykascommented, Oct 29, 2021

Awesome thank you!

1reaction
RezaYazdaniAminabadicommented, Oct 27, 2021

Hi @PeterDykas ,

Thanks for pointing this issue. I will look into this and send a PR to fix this soon.

Thanks, Reza

Read more comments on GitHub >

github_iconTop Results From Across the Web

Untrainable dense layer in TFBert. "WARNING:tensorflow ...
Bug I am getting WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model/bert/pooler/dense/kernel:0', ...
Read more >
37 Reasons why your Neural Network is not working - Slav
The network had been training for the last 12 hours. It all looked good: the gradients were flowing and the loss was decreasing....
Read more >
Proxy-Normalizing Activations to Match Batch ... - OpenReview
We introduce a batch-independent normalization that consistently matches batch normalization in both behavior and performance.
Read more >
On Layer Normalization in the Transformer Architecture - arXiv
On Theorem 1 Theorem 1 suggests that for any sizes of the Post-LN Transformer, the scale of the gradient norm in the last...
Read more >
What should I do when my neural network doesn't learn?
Neural networks are not "off-the-shelf" algorithms in the way that random forest or logistic ... Batch or Layer normalization can improve network training....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found