CLIPTextModel gives invalid output for zeroed attention mask
See original GitHub issueSystem Info
transformers
version: 4.21.2- Platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.33
- Python version: 3.8.6
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.12.0+cu102 (False)
- Tensorflow version (GPU?): 2.6.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.4.2 (cpu)
- Jax version: 0.3.10
- JaxLib version: 0.3.10
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
Reproduction
from transformers import CLIPTextModel
import torch
model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
inputs = {
"input_ids": torch.tensor([[49406, 320, 1125, 539, 320, 1929, 49407]]),
"attention_mask": torch.tensor([[0, 0, 0, 0, 0, 0, 0]])
}
outputs = model(**inputs)
Given the zeroed attention mask, the attention weights should be all equal here:
However, causal and attention masks are added separately (here), so in this case, before going through softmax, certain values are twice as small as the other ones (to be more precise, some values are -min_float and other are -inf). Consequently, softmax outputs probabilities that match the causal mask.
This is also the case for TFCLIPTextModel
.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
CLIPTextModel gives invalid output for zeroed attention mask
I am assuming Stable Diffusion was trained using the PT CLIPTextModel, and thus results rely on this inconsistent/invalid text embedding?
Read more >output padding different to zero in hidden layers with attention ...
Bug On the last layer, the token corresponding to the padding does not return 0, even if attention masking is used.
Read more >Glossary - Hugging Face
The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them....
Read more >padding and attention mask does not work as intended in ...
Use attention mask [1,1,1], that means attend even on the padding zero, you get the same output which is \n . Use the...
Read more >Masking in Transformers' self-attention mechanism | - Medium
Masking is needed to prevent the attention mechanism of a transformer from “cheating” in the decoder when training (on a translating task for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So I think we can close this 😃
The additive mask before softmax is a trick that works under the assumption that the mask has at least a single 1. So from what I understand can be said that for zeroed mask the output of most models is just not well defined, and it’s fine given the use cases so far.
Thanks a lot for the issue @jonatanklosko !
This indeed seems a bit strange, I see two solutions here
however, is it really a bug ? as long as the values for masked positions are much lower than non-masked tokens, those tokens will still be ignored. Do you have an example where you see the masked positions are not ignored ?
@seanmor5 good point!
I’m not sure if this impacts the quality of diffusers, for example, as discussed in this issue, we have verified that the results are 1:1 with the original repo.
Yes, it was trained using
CLIPTextModel
, but both this training and the actual pre-trained CLIP model never used attention mask. They always pad the sequence to max_len 77 and use causal mask. This is how we recommend to use the stable diffusion model. I know this is not ideal but that’s how it was trained. cf https://github.com/CompVis/stable-diffusion/blob/main/ldm/modules/encoders/modules.py#L155Also cc @patrickvonplaten , wdyt ?