question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about residual connection

See original GitHub issue

Hi, thank you so much for your work!

I have one question about the self-attention implementation. In the paper Attention is All You Need, the residual connection is made upon input embeddings + positional encoding as shown in the figure below. image

In the paper, the figure seems to match above as shown in the paper below. image

However, in the code, it looks to me that the residual connection is made upon input embeddings only (the src), also see figure below. Is this a mistake or there is a reason for such modification? Thank you! image

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
szagoruykocommented, Jun 8, 2020

@mli0603 see the section 4.2 ‘Importance of positional encodings’ in the paper for the architectural choices on where to pass positional encodings. There is also Table 3, in which the second row corresponds to vanilla Transformer where we pass positional encodings once at transformer input, the one you are referring to (also used in the demo colab). As we explain in text, passing encodings in attention directly leads to a significant performance boost.

0reactions
dashesycommented, Dec 30, 2020

That paragraph now makes much more sense. There is also another deviation from original transformer it seems, in that you apply position encoding only on key and query, but not on value. I did not see in the paper if that choice also improves the performance.

Even in the original paper value output from encoder does not get position embedding treatment, so it makes sense to avoid it for all the values. You add the position (to k,q) for all layers, so it makes sense to not add it to value in any of them.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is Residual Connection? - Towards Data Science
Residual connection provides another path for data to reach latter parts of the neural network by skipping some layers. Consider a sequence of ......
Read more >
What are "residual connections" in RNNs? - Cross Validated
Residual connections are the same thing as 'skip connections'. They are used to allow gradients to flow through a network directly, ...
Read more >
Residual Connection Explained - Papers With Code
Residual Connections are a type of skip-connection that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions ...
Read more >
Question Answering with Self-Attention and Residuals
to train and scale up. We propose a new question answering architecture that combines RNNs with self-attention and residual connections to speed up...
Read more >
Understanding and implementation of Residual Networks ...
[Link to the research paper] and Convolutional Neural Network course by Andrew Ng. Table of Contents: Introduction — The problem of very deep...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found