Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mismatch of implementations of attention mask in transformers and tokenizers

See original GitHub issue

Environment info

transformers version: 4.10.2

Models: BERT and other Transformer models

Library:

tokenizers: @LysandreJik

Problem and solution:

In the current implementation of transformers, the attention mask is added to the attention scores before softmax. For instance: https://github.com/huggingface/transformers/blob/3ab0185b061baae207efed02799dd424ee8377f1/src/transformers/models/bert/modeling_bert.py#L326

When this kind of additive mask is used before softmax, then the masking values should be -inf. However, the tokenizers of transformers library outputs 0 for the attention masks (for paddings) and 1 for the other elements: https://huggingface.co/transformers/glossary.html#attention-mask This would have very unpredictable behaviour on the final attention values after softmax. Sometimes the attention mask may have no effect at all depending on the magnitude of the attention scores before softmax.

An alternative implementation is using multiplicative or logical_and to apply attention mask to attention scores (Pytorch, Tensorflow, Jax). Then it is fine to assign value 0 to paddings in the attention mask.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

NielsRoggecommented, Sep 14, 2021

Hi,

the attention_mask is first preprocessed by a function called get_extended_attention_mask as can be seen here. This function is defined in modeling_utils.py. This function turns the attention_mask into a tensor that is 0 for positions we want to attend to, and -10.000 for positions we don’t want to attend to as seen here.

0reactions

github-actions[bot]commented, Oct 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

Mismatch between logits from generate and forward with an ...

I'm trying to get logits or log-probabilities from generate to match those from forward in the presence of a padded prompt. For GPT...

Glossary - Hugging Face

This glossary defines general machine learning and Transformers terms to help you better understand the documentation. A. attention mask. The attention mask ......

Hugging Face Transformers: Fine-tuning DistilBERT for Binary ...

This is optional, but attention masks tell your model what tokens to pay attention to and which to ignore (in the case of...

The Annotated Transformer - Harvard NLP

We implement this inside of scaled dot- product attention by masking out (setting to −∞ − ∞ ) all values in the input...

Transformers in Vision: A Survey - arXiv

have a unique implementation (i.e., multi-head attention) ... the self-attention blocks used in the decoder are masked to ... tokenize” an input image....