Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Clarifying attention mask

See original GitHub issue

I don’t quite understand the attention mask in the way that it’s implemented.

Here is the relevant line: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L312 :

...
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
...

So it seems the proper way to use attention_mask is to set the positions you want to keep to 1’s, and positions you want to mask out to 0’s.

Curious why we don’t simply multiply instead of add and then normalize? Is it for stability reasons?

Issue Analytics

State:
Created 4 years ago
Reactions:10
Comments:11 (3 by maintainers)

Top GitHub Comments

13reactions

Rocketknight1commented, Apr 26, 2019

The reason a classic binary attention mask won’t work here is that the Softmax activation includes an exponential, and so an input of 0 can still yield quite a large softmax weight (since e^0 = 1).

The mask can’t be applied after the softmax, because then the resulting values will not sum to 1. So the best solution is to add (not multiply!) a large negative value to the indices you want to mask. That means they will be 0 or almost 0 after the softmax step (because as you make x more negative, e^x becomes closer and closer to 0).

2reactions

kyo-jincommented, Mar 20, 2021

@xiangrongzeng Just a passerby here - but I believe this is the method where [0, 1] attention masks are mapped to the [-inf, 0] range: https://github.com/huggingface/transformers/blob/88a951e3cc00f56b94d9b93dbc35a3812cd88747/src/transformers/modeling_utils.py#L221-L281

…and the specific operation in question: https://github.com/huggingface/transformers/blob/88a951e3cc00f56b94d9b93dbc35a3812cd88747/src/transformers/modeling_utils.py#L274-L281

This method lives in the ModuleUtilsMixin, which I’m assuming is inherited by downstream models.