question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about the normalization layer in the Decoder of the Transformer

See original GitHub issue

Hi, First of all, thanks for the great repo.

The output of the TransformerDecoderLayer is normalized as the parameter normalize_before is set as False. Why do we need to normalize the output of TransformerDecoder again in line113 and line116? (It is always True to pass the decoder_norm to the TransformerDecoder.) Is it intended to do that? Or the decoder_norm should be the same as the encoder_norm, which depends on the normalize_before. Any explanation would be appreciated.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:3
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
turboxincommented, Sep 18, 2020

https://github.com/pytorch/pytorch/issues/24930 explained that nn.LayerNorm(nn.LayerNorm(x)) is equal to nn.LayerNorm(x), so duplicated normalization is OK.

2reactions
chrisway613commented, Jul 19, 2020

I‘ve found this question too! If ‘normalize_before’ is set as False, it will do duplicated normalization.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is the purpose of Add & Norm layers in Transformers?
It is one of the solutions for vanishing gradient problem. The norm step is about layer normalization (Ba et al, 2016), it is...
Read more >
Layer Normalization Explained | Papers With Code
More recently, it has been used with Transformer models. We compute the layer normalization statistics over all the hidden units in the same...
Read more >
The Transformers in NLP - Medium
The first layer if decoder is Masked-Multi headed attention followed by add & normalization. Masked multi headed attention here takes an output embedding...
Read more >
On Layer Normalization in the Transformer Architecture - arXiv
Therefore, us- ing a large learning rate on those gradients makes the training unstable. The warm-up stage is prac- tically helpful for avoiding...
Read more >
Illustrated Guide to Transformers- Step by Step Explanation
The decoder has a similar sub-layer as the encoder. it has two multi-headed attention layers, a pointwise feed-forward layer, and residual ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found