Mismatch of implementations of attention mask in transformers and tokenizers
See original GitHub issueEnvironment info
transformers
version: 4.10.2
Models: BERT and other Transformer models
Library:
- tokenizers: @LysandreJik
Problem and solution:
In the current implementation of transformers, the attention mask is added to the attention scores before softmax. For instance: https://github.com/huggingface/transformers/blob/3ab0185b061baae207efed02799dd424ee8377f1/src/transformers/models/bert/modeling_bert.py#L326
When this kind of additive mask is used before softmax, then the masking values should be -inf. However, the tokenizers of transformers library outputs 0 for the attention masks (for paddings) and 1 for the other elements: https://huggingface.co/transformers/glossary.html#attention-mask This would have very unpredictable behaviour on the final attention values after softmax. Sometimes the attention mask may have no effect at all depending on the magnitude of the attention scores before softmax.
An alternative implementation is using multiplicative or logical_and to apply attention mask to attention scores (Pytorch, Tensorflow, Jax). Then it is fine to assign value 0 to paddings in the attention mask.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (4 by maintainers)
Hi,
the
attention_mask
is first preprocessed by a function calledget_extended_attention_mask
as can be seen here. This function is defined inmodeling_utils.py
. This function turns theattention_mask
into a tensor that is 0 for positions we want to attend to, and -10.000 for positions we don’t want to attend to as seen here.This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.