will it make any sense to use zero v in the first decoder layer?
See original GitHub issueAs in your code, tgt
of the decoderlayer was firstly assigned with zeros, and use these zeros as v
to calculate a new ouput with qkv attention operation, take the pre-norm forward part for example:
def forward_pre(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
tgt2 = self.norm1(tgt)
q = k = self.with_pos_embed(tgt2, query_pos)
tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask)[0]
tgt = tgt + self.dropout1(tgt2)
tgt2 = self.norm2(tgt)
tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
key=self.with_pos_embed(memory, pos),
value=memory, attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask)[0]
tgt = tgt + self.dropout2(tgt2)
tgt2 = self.norm3(tgt)
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
tgt = tgt + self.dropout3(tgt2)
return tgt
I mean, if it was the first decoderlayer, tgt
was token-wisely zero, then tgt2
will be token-wisely same after the first layernorm, how will that make any sense to get weighed output from this tgt2
? No matter what the q
and k
is, nothing but a featureless bias will be learned I think.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Transformer-based Encoder-Decoder Models
To generate the first target vector, the decoder is fed the BOS vector, illustrated as y0 in the design above. The target vector...
Read more >Transformers Explained Visually (Part 2): How it works, ...
The Transformer has two Embedding layers. The input sequence is fed to the first Embedding layer, known as the Input Embedding.
Read more >Which activation function for output layer? - Cross Validated
First of all: the activation function g(x) at the output layer often depends on your cost function. This is done to make the...
Read more >Why multi-head self attention works: math, intuitions and 10 ...
Learn everything there is to know about the attention mechanisms of the ... The same principles apply in the encoder-decoder attention or ......
Read more >Seq2seq and Attention - Lena Voita
The simplest encoder-decoder model consists of two RNNs (LSTMs): one for the encoder and another for the decoder. Encoder RNN reads the source...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Your understanding is correct. The first decoder self attention does not receive any data dependent inputs, so we pass zeros as inputs. It could be removed to save some parameters and compute, but we keep for simplicity.
We should add a comment explaining this in the code.
@rardz @szagoruyko
tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask)[0]
https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention
Here self_attn will add a projection layer with learnable bias on q, k, and v. Though the tgt2 is a zero tensor at the first layer, it will still hold non-zero values.