Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

will it make any sense to use zero v in the first decoder layer?

See original GitHub issue

As in your code, tgt of the decoderlayer was firstly assigned with zeros, and use these zeros as v to calculate a new ouput with qkv attention operation, take the pre-norm forward part for example:

    def forward_pre(self, tgt, memory,
                    tgt_mask: Optional[Tensor] = None,
                    memory_mask: Optional[Tensor] = None,
                    tgt_key_padding_mask: Optional[Tensor] = None,
                    memory_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None,
                    query_pos: Optional[Tensor] = None):
        tgt2 = self.norm1(tgt)
        q = k = self.with_pos_embed(tgt2, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt2 = self.norm2(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt2 = self.norm3(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
        tgt = tgt + self.dropout3(tgt2)
        return tgt

I mean, if it was the first decoderlayer, tgt was token-wisely zero, then tgt2 will be token-wisely same after the first layernorm, how will that make any sense to get weighed output from this tgt2? No matter what the q and k is, nothing but a featureless bias will be learned I think.

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

9reactions

szagoruykocommented, Jun 2, 2020

Your understanding is correct. The first decoder self attention does not receive any data dependent inputs, so we pass zeros as inputs. It could be removed to save some parameters and compute, but we keep for simplicity.

We should add a comment explaining this in the code.

0reactions

zachluocommented, Dec 9, 2022

@rardz @szagoruyko

tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask)[0]

https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention

Here self_attn will add a projection layer with learnable bias on q, k, and v. Though the tgt2 is a zero tensor at the first layer, it will still hold non-zero values.