question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

will it make any sense to use zero v in the first decoder layer?

See original GitHub issue

As in your code, tgt of the decoderlayer was firstly assigned with zeros, and use these zeros as v to calculate a new ouput with qkv attention operation, take the pre-norm forward part for example:

    def forward_pre(self, tgt, memory,
                    tgt_mask: Optional[Tensor] = None,
                    memory_mask: Optional[Tensor] = None,
                    tgt_key_padding_mask: Optional[Tensor] = None,
                    memory_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None,
                    query_pos: Optional[Tensor] = None):
        tgt2 = self.norm1(tgt)
        q = k = self.with_pos_embed(tgt2, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt2 = self.norm2(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt2 = self.norm3(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
        tgt = tgt + self.dropout3(tgt2)
        return tgt

I mean, if it was the first decoderlayer, tgt was token-wisely zero, then tgt2 will be token-wisely same after the first layernorm, how will that make any sense to get weighed output from this tgt2? No matter what the q and k is, nothing but a featureless bias will be learned I think.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

9reactions
szagoruykocommented, Jun 2, 2020

Your understanding is correct. The first decoder self attention does not receive any data dependent inputs, so we pass zeros as inputs. It could be removed to save some parameters and compute, but we keep for simplicity.

We should add a comment explaining this in the code.

0reactions
zachluocommented, Dec 9, 2022

@rardz @szagoruyko

tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask)[0]

https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention

Here self_attn will add a projection layer with learnable bias on q, k, and v. Though the tgt2 is a zero tensor at the first layer, it will still hold non-zero values.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Transformer-based Encoder-Decoder Models
To generate the first target vector, the decoder is fed the BOS vector, illustrated as y0 in the design above. The target vector...
Read more >
Transformers Explained Visually (Part 2): How it works, ...
The Transformer has two Embedding layers. The input sequence is fed to the first Embedding layer, known as the Input Embedding.
Read more >
Which activation function for output layer? - Cross Validated
First of all: the activation function g(x) at the output layer often depends on your cost function. This is done to make the...
Read more >
Why multi-head self attention works: math, intuitions and 10 ...
Learn everything there is to know about the attention mechanisms of the ... The same principles apply in the encoder-decoder attention or ......
Read more >
Seq2seq and Attention - Lena Voita
The simplest encoder-decoder model consists of two RNNs (LSTMs): one for the encoder and another for the decoder. Encoder RNN reads the source...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found