Why the outputs from decoder layers are concatenated while this is not the case for the encoder
See original GitHub issueHi,
I have a question regarding the code for the transformer encoder & decoder. I am looking into the case args.deformable=False
and args.Tracking=False
. Looking into transformer.py
I am a bit confused about the code. I check the self.num_layers
and they are 6 for both encoder and decoder. However, looking at the code, In the decoder forward, output from each layer is stored, and then they are stacked for the final output.
for i, layer in enumerate(self.layers):
if self.track_attention:
track_output = output[:-100].clone()
track_output = self.layers_track_attention[i](
track_output,
src_mask=tgt_mask,
src_key_padding_mask=tgt_key_padding_mask,
pos=track_query_pos)
output = torch.cat([track_output, output[-100:]])
output = layer(output, memory, tgt_mask=tgt_mask,
memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask,
pos=pos, query_pos=query_pos)
if self.return_intermediate:
intermediate.append(output)
if self.return_intermediate:
output = torch.stack(intermediate)
if self.norm is not None:
return self.norm(output), output
for the encoder however the loop is done but the output is the result of the last layer only.
for layer in self.layers:
output = layer(output, src_mask=mask,
src_key_padding_mask=src_key_padding_mask, pos=pos)
print(output.shape,'enc-layer')
if self.norm is not None:
output = self.norm(output)
Is this a bug or am I misunderstood?
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Concatenate layer shape error in sequence2sequence model ...
When you specify axis as -1 (default value), your concatenation layer basically flattens the input before use, which in your case does not ......
Read more >Transformer-based Encoder-Decoder Models - Hugging Face
The transformer-based encoder-decoder model was introduced by Vaswani ... output space, which the self-attention layer does not manage to do ...
Read more >Transformers Explained Visually (Part 3): Multi-head Attention ...
The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value. All three parameters are...
Read more >How Does Attention Work in Encoder-Decoder Recurrent ...
In this case, a bidirectional input is used where the input sequences are provided both forward and backward, which are then concatenated ...
Read more >Understanding and Improving Encoder Layer Fusion in ...
The experiments show that the encoder embedding layer is beneficial for all decoder layers, why the proposed SurfaceFusion does not consider connecting the ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Please read the paper and its related work to fully understand how our method is working.
Does that mean the transformer has losses besides the ones that are introduced in the paper?