Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about residual connection

See original GitHub issue

Hi, thank you so much for your work!

I have one question about the self-attention implementation. In the paper Attention is All You Need, the residual connection is made upon input embeddings + positional encoding as shown in the figure below.

In the paper, the figure seems to match above as shown in the paper below.

However, in the code, it looks to me that the residual connection is made upon input embeddings only (the src), also see figure below. Is this a mistake or there is a reason for such modification? Thank you!

Issue Analytics

State:
Created 3 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

3reactions

szagoruykocommented, Jun 8, 2020

@mli0603 see the section 4.2 ‘Importance of positional encodings’ in the paper for the architectural choices on where to pass positional encodings. There is also Table 3, in which the second row corresponds to vanilla Transformer where we pass positional encodings once at transformer input, the one you are referring to (also used in the demo colab). As we explain in text, passing encodings in attention directly leads to a significant performance boost.

0reactions

dashesycommented, Dec 30, 2020

That paragraph now makes much more sense. There is also another deviation from original transformer it seems, in that you apply position encoding only on key and query, but not on value. I did not see in the paper if that choice also improves the performance.

Even in the original paper value output from encoder does not get position embedding treatment, so it makes sense to avoid it for all the values. You add the position (to k,q) for all layers, so it makes sense to not add it to value in any of them.

Top Results From Across the Web

What is Residual Connection? - Towards Data Science

Residual connection provides another path for data to reach latter parts of the neural network by skipping some layers. Consider a sequence of ......

What are "residual connections" in RNNs? - Cross Validated

Residual connections are the same thing as 'skip connections'. They are used to allow gradients to flow through a network directly, ...

Residual Connection Explained - Papers With Code

Residual Connections are a type of skip-connection that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions ...

Question Answering with Self-Attention and Residuals

to train and scale up. We propose a new question answering architecture that combines RNNs with self-attention and residual connections to speed up...

Understanding and implementation of Residual Networks ...

[Link to the research paper] and Convolutional Neural Network course by Andrew Ng. Table of Contents: Introduction — The problem of very deep...