Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What's the consideration of not apply positional encoding for V at the self-attention layer?

See original GitHub issue

Question: What’s the consideration of not apply positional encoding for V at the self-attention layer?

def forward_post(self, 
                     src, 
                     src_mask: Optional[Tensor] = None,
                     src_key_padding_mask: Optional[Tensor] = None,
                     pos: Optional[Tensor] = None):

        q = k = self.with_pos_embed(src, pos)
        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
                              key_padding_mask=src_key_padding_mask)[0]

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

3reactions

szagoruykocommented, Jun 27, 2020

Hi, we follow the standard practice, see Transformer XL or Stand-alone self-attention in vision, in adding positional encoding to queries and keys only, except in our case it is absolute and not relative.

1reaction

alexzeng1206commented, May 23, 2022

+1. I have read the code and find that although the positional encoding is added within q and k during the computation of self/cross-attention, the features are just obtained from v (only contain appearance feature without positional encoding) throughout the process, so I don’t know why the final output slot contains spatial information that allows the FFN to predict the bounding box based on it. Can anyone help to explain where such spatial information comes from?

Top Results From Across the Web

Self-Attention and Positional Encoding - mxnet - D2L Discussion

Hey @sprajagopal, great question! First just for clarification, Q,K and V don't need to be the same. They might be the same in...

Relative Positional Encoding - Jake Tae

In this post, we will take a look at relative positional encoding, as introduced in Shaw et al (2018) and refined by Huang...

RETHINKING POSITIONAL ENCODING IN LANGUAGE PRE ...

we propose a new positional encoding method called Transformer with Untied. Positional Encoding (TUPE). In the self-attention module, TUPE computes the.

Attention Mechanism, Transformers, BERT, and GPT - OSF

self -attention, and attention in different areas ... which do not use any recurrence. We ex- ... the transformer, including positional encoding,.

Relative Positional Encoding for Transformers with Linear ...

where φ : RD → RR is a non-linear feature map applied ... As an example of positional encoding in the attention ......