Question about residual connection
See original GitHub issueHi, thank you so much for your work!
I have one question about the self-attention implementation. In the paper Attention is All You Need, the residual connection is made upon input embeddings + positional encoding
as shown in the figure below.
In the paper, the figure seems to match above as shown in the paper below.
However, in the code, it looks to me that the residual connection is made upon input embeddings
only (the src
), also see figure below. Is this a mistake or there is a reason for such modification? Thank you!
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
What is Residual Connection? - Towards Data Science
Residual connection provides another path for data to reach latter parts of the neural network by skipping some layers. Consider a sequence of ......
Read more >What are "residual connections" in RNNs? - Cross Validated
Residual connections are the same thing as 'skip connections'. They are used to allow gradients to flow through a network directly, ...
Read more >Residual Connection Explained - Papers With Code
Residual Connections are a type of skip-connection that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions ...
Read more >Question Answering with Self-Attention and Residuals
to train and scale up. We propose a new question answering architecture that combines RNNs with self-attention and residual connections to speed up...
Read more >Understanding and implementation of Residual Networks ...
[Link to the research paper] and Convolutional Neural Network course by Andrew Ng. Table of Contents: Introduction — The problem of very deep...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@mli0603 see the section 4.2 ‘Importance of positional encodings’ in the paper for the architectural choices on where to pass positional encodings. There is also Table 3, in which the second row corresponds to vanilla Transformer where we pass positional encodings once at transformer input, the one you are referring to (also used in the demo colab). As we explain in text, passing encodings in attention directly leads to a significant performance boost.
That paragraph now makes much more sense. There is also another deviation from original transformer it seems, in that you apply position encoding only on key and query, but not on value. I did not see in the paper if that choice also improves the performance.
Even in the original paper value output from encoder does not get position embedding treatment, so it makes sense to avoid it for all the values. You add the position (to k,q) for all layers, so it makes sense to not add it to value in any of them.