question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mismatch of implementations of attention mask in transformers and tokenizers

See original GitHub issue

Environment info

  • transformers version: 4.10.2

Models: BERT and other Transformer models

Library:

Problem and solution:

In the current implementation of transformers, the attention mask is added to the attention scores before softmax. For instance: https://github.com/huggingface/transformers/blob/3ab0185b061baae207efed02799dd424ee8377f1/src/transformers/models/bert/modeling_bert.py#L326

When this kind of additive mask is used before softmax, then the masking values should be -inf. However, the tokenizers of transformers library outputs 0 for the attention masks (for paddings) and 1 for the other elements: https://huggingface.co/transformers/glossary.html#attention-mask This would have very unpredictable behaviour on the final attention values after softmax. Sometimes the attention mask may have no effect at all depending on the magnitude of the attention scores before softmax.

An alternative implementation is using multiplicative or logical_and to apply attention mask to attention scores (Pytorch, Tensorflow, Jax). Then it is fine to assign value 0 to paddings in the attention mask.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
NielsRoggecommented, Sep 14, 2021

Hi,

the attention_mask is first preprocessed by a function called get_extended_attention_mask as can be seen here. This function is defined in modeling_utils.py. This function turns the attention_mask into a tensor that is 0 for positions we want to attend to, and -10.000 for positions we don’t want to attend to as seen here.

0reactions
github-actions[bot]commented, Oct 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Mismatch between logits from generate and forward with an ...
I'm trying to get logits or log-probabilities from generate to match those from forward in the presence of a padded prompt. For GPT...
Read more >
Glossary - Hugging Face
This glossary defines general machine learning and Transformers terms to help you better understand the documentation. A. attention mask. The attention mask ......
Read more >
Hugging Face Transformers: Fine-tuning DistilBERT for Binary ...
This is optional, but attention masks tell your model what tokens to pay attention to and which to ignore (in the case of...
Read more >
The Annotated Transformer - Harvard NLP
We implement this inside of scaled dot- product attention by masking out (setting to −∞ − ∞ ) all values in the input...
Read more >
Transformers in Vision: A Survey - arXiv
have a unique implementation (i.e., multi-head attention) ... the self-attention blocks used in the decoder are masked to ... tokenize” an input image....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found