Why encoder and decoder use "non_pad_mask"?
See original GitHub issueI think the non_pad_mask
is not necessary, because processing of padding is done by attn_mask
.
Why is it necessary?
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Transformer-based Encoder-Decoder Models - Hugging Face
proposed to use an encoder-decoder model purely based on recurrent neural networks (RNNs) for sequence-to-sequence tasks. In contrast to DNNS, ...
Read more >Encoder Decoder What and Why ? - Simple Explanation
The Encoder-Decoder is a neural network discovered in 2014 and used in many projects. It is a fundamental cornerstone in translation software.
Read more >Understanding Encoder-Decoder Sequence to Sequence Model
The reason is that using a single vector for encoding the whole input sequence is not capable of capturing the whole information. This...
Read more >Transformer's Encoder-Decoder: Let's Understand The Model ...
The encoder extracts features from an input sentence, and the decoder uses the features to produce an output sentence (translation). Transformer ...
Read more >Difference between Encoder and Decoder - GeeksforGeeks
Encoder and Decoder are the combinational logic circuits. ... An AND gate can be used as the basic decoding element because it produces...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think it’s because “get_attn_key_pad_mask” doesn’t pad necessary places completely, so “get_non_pad_mask” is masking the parts left out.
Example: This is a sentence: [I, love, Github, PAD, PAD, PAD] After “padding_mask = seq_k.eq(Constants.PAD)”, this sentence would turn into [0 0 0 1 1] After “padding_mask = padding_mask.unsqueeze(1).expand(-1, len_q, -1)”, it turns into 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 (mask1)
However, the mask should be actually 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 (mask2)
Mask1 is used in DotProductAttention: we get the enc_output from “enc_output, enc_slf_attn = self.slf_attn(enc_input, enc_input, enc_input, mask=slf_attn_mask)”. However, mask2 is what should be used.
Then, let’s see “get_non_pad”: Same sentence, after “seq.ne(Constants.PAD).type(torch.float).unsqueeze(-1)”, it turns into: 1 1 1 0 0
So, this is what “enc_output *= non_pad_mask” do. After this dot product, the positions left in mask1 was finally masked, and the length of sentence returns to its true length before padding.
You’re right. I don’t think there are any problems without
non_pad_mask
. May the author have a better reply. @jadore801120