Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why encoder and decoder use "non_pad_mask"?

See original GitHub issue

https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/20f355eb655bad40195ae302b9d8036716be9a23/transformer/Layers.py#L23

I think the non_pad_mask is not necessary, because processing of padding is done by attn_mask. Why is it necessary?

Issue Analytics

State:
Created 5 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

5reactions

Yonnie1331commented, Apr 24, 2019

I think it’s because “get_attn_key_pad_mask” doesn’t pad necessary places completely, so “get_non_pad_mask” is masking the parts left out.

Example: This is a sentence: [I, love, Github, PAD, PAD, PAD] After “padding_mask = seq_k.eq(Constants.PAD)”, this sentence would turn into [0 0 0 1 1] After “padding_mask = padding_mask.unsqueeze(1).expand(-1, len_q, -1)”, it turns into 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 (mask1)

However, the mask should be actually 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 (mask2)

Mask1 is used in DotProductAttention: we get the enc_output from “enc_output, enc_slf_attn = self.slf_attn(enc_input, enc_input, enc_input, mask=slf_attn_mask)”. However, mask2 is what should be used.

Then, let’s see “get_non_pad”: Same sentence, after “seq.ne(Constants.PAD).type(torch.float).unsqueeze(-1)”, it turns into: 1 1 1 0 0

So, this is what “enc_output *= non_pad_mask” do. After this dot product, the positions left in mask1 was finally masked, and the length of sentence returns to its true length before padding.

1reaction

PkuDavidGuancommented, Feb 18, 2019

You’re right. I don’t think there are any problems without non_pad_mask. May the author have a better reply. @jadore801120