Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about the normalization layer in the Decoder of the Transformer

See original GitHub issue

Hi, First of all, thanks for the great repo.

The output of the TransformerDecoderLayer is normalized as the parameter normalize_before is set as False. Why do we need to normalize the output of TransformerDecoder again in line113 and line116? (It is always True to pass the decoder_norm to the TransformerDecoder.) Is it intended to do that? Or the decoder_norm should be the same as the encoder_norm, which depends on the normalize_before. Any explanation would be appreciated.

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:5 (1 by maintainers)

Top GitHub Comments

2reactions

turboxincommented, Sep 18, 2020

https://github.com/pytorch/pytorch/issues/24930 explained that nn.LayerNorm(nn.LayerNorm(x)) is equal to nn.LayerNorm(x), so duplicated normalization is OK.

2reactions

chrisway613commented, Jul 19, 2020

I‘ve found this question too! If ‘normalize_before’ is set as False, it will do duplicated normalization.

Top Results From Across the Web

What is the purpose of Add & Norm layers in Transformers?

It is one of the solutions for vanishing gradient problem. The norm step is about layer normalization (Ba et al, 2016), it is...

Layer Normalization Explained | Papers With Code

More recently, it has been used with Transformer models. We compute the layer normalization statistics over all the hidden units in the same...

The Transformers in NLP - Medium

The first layer if decoder is Masked-Multi headed attention followed by add & normalization. Masked multi headed attention here takes an output embedding...

On Layer Normalization in the Transformer Architecture - arXiv

Therefore, us- ing a large learning rate on those gradients makes the training unstable. The warm-up stage is prac- tically helpful for avoiding...

Illustrated Guide to Transformers- Step by Step Explanation

The decoder has a similar sub-layer as the encoder. it has two multi-headed attention layers, a pointwise feed-forward layer, and residual ...