Clarifying attention mask
See original GitHub issueI don’t quite understand the attention mask in the way that it’s implemented.
Here is the relevant line: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L312 :
...
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
...
So it seems the proper way to use attention_mask
is to set the positions you want to keep to 1’s, and positions you want to mask out to 0’s.
Curious why we don’t simply multiply instead of add and then normalize? Is it for stability reasons?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:10
- Comments:11 (3 by maintainers)
Top Results From Across the Web
Clarification on the attention_mask - Hugging Face Forums
Hello everyone, I am trying to build a domain specific RoBERTa model and I need clarification on the attention_mask usage.
Read more >How Do Self-Attention Masks Work? | by Gabriel Mongaras
The attention mask is essentially a way to stop the model from looking at the information we don't want it to look at....
Read more >Can the attention mask hold values between 0 and 1?
I am new to attention-based models and wanted to understand more about the attention mask in NLP models. attention_mask : an optional torch....
Read more >Amazon.com: GOLDFADEN MD Facial Detox Mask
GOLDFADEN MD Facial Detox Mask | Clarifying Face Mask w/Zinc Oxide, Sulfur, Salicylic Acid & Grapefruit | Purifying Deep Clean | Minimizes the...
Read more >Tata Harper Clarifying Mask 30ML - Space NK
Get even skin tone and do away with blemishes with Tata Clarifying Mask. It minimises the appearance of pores and helps to reduce...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The reason a classic binary attention mask won’t work here is that the Softmax activation includes an exponential, and so an input of 0 can still yield quite a large softmax weight (since e^0 = 1).
The mask can’t be applied after the softmax, because then the resulting values will not sum to 1. So the best solution is to add (not multiply!) a large negative value to the indices you want to mask. That means they will be 0 or almost 0 after the softmax step (because as you make x more negative, e^x becomes closer and closer to 0).
@xiangrongzeng Just a passerby here - but I believe this is the method where
[0, 1]
attention masks are mapped to the[-inf, 0]
range: https://github.com/huggingface/transformers/blob/88a951e3cc00f56b94d9b93dbc35a3812cd88747/src/transformers/modeling_utils.py#L221-L281…and the specific operation in question: https://github.com/huggingface/transformers/blob/88a951e3cc00f56b94d9b93dbc35a3812cd88747/src/transformers/modeling_utils.py#L274-L281
This method lives in the
ModuleUtilsMixin
, which I’m assuming is inherited by downstream models.