Question about the normalization layer in the Decoder of the Transformer
See original GitHub issueHi, First of all, thanks for the great repo.
The output of the TransformerDecoderLayer
is normalized as the parameter normalize_before
is set as False
. Why do we need to normalize the output of TransformerDecoder
again in line113 and line116? (It is always True to pass the decoder_norm
to the TransformerDecoder
.)
Is it intended to do that? Or the decoder_norm
should be the same as the encoder_norm
, which depends on the normalize_before
. Any explanation would be appreciated.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:5 (1 by maintainers)
Top Results From Across the Web
What is the purpose of Add & Norm layers in Transformers?
It is one of the solutions for vanishing gradient problem. The norm step is about layer normalization (Ba et al, 2016), it is...
Read more >Layer Normalization Explained | Papers With Code
More recently, it has been used with Transformer models. We compute the layer normalization statistics over all the hidden units in the same...
Read more >The Transformers in NLP - Medium
The first layer if decoder is Masked-Multi headed attention followed by add & normalization. Masked multi headed attention here takes an output embedding...
Read more >On Layer Normalization in the Transformer Architecture - arXiv
Therefore, us- ing a large learning rate on those gradients makes the training unstable. The warm-up stage is prac- tically helpful for avoiding...
Read more >Illustrated Guide to Transformers- Step by Step Explanation
The decoder has a similar sub-layer as the encoder. it has two multi-headed attention layers, a pointwise feed-forward layer, and residual ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
https://github.com/pytorch/pytorch/issues/24930 explained that nn.LayerNorm(nn.LayerNorm(x)) is equal to nn.LayerNorm(x), so duplicated normalization is OK.
I‘ve found this question too! If ‘normalize_before’ is set as False, it will do duplicated normalization.