question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CLIPTextModel gives invalid output for zeroed attention mask

See original GitHub issue

System Info

  • transformers version: 4.21.2
  • Platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.33
  • Python version: 3.8.6
  • Huggingface_hub version: 0.5.1
  • PyTorch version (GPU?): 1.12.0+cu102 (False)
  • Tensorflow version (GPU?): 2.6.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.4.2 (cpu)
  • Jax version: 0.3.10
  • JaxLib version: 0.3.10
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@patil-suraj

Reproduction

from transformers import CLIPTextModel
import torch

model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")

inputs = {
  "input_ids": torch.tensor([[49406, 320, 1125, 539, 320, 1929, 49407]]),
  "attention_mask": torch.tensor([[0, 0, 0, 0, 0, 0, 0]])
}

outputs = model(**inputs)

Given the zeroed attention mask, the attention weights should be all equal here:

https://github.com/huggingface/transformers/blob/21f6f58721dd9154357576be6de54eefef1f1818/src/transformers/models/clip/modeling_clip.py#L246

However, causal and attention masks are added separately (here), so in this case, before going through softmax, certain values are twice as small as the other ones (to be more precise, some values are -min_float and other are -inf). Consequently, softmax outputs probabilities that match the causal mask.

This is also the case for TFCLIPTextModel.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
jonatankloskocommented, Sep 2, 2022

So I think we can close this 😃

The additive mask before softmax is a trick that works under the assumption that the mask has at least a single 1. So from what I understand can be said that for zeroed mask the output of most models is just not well defined, and it’s fine given the use cases so far.

1reaction
patil-surajcommented, Aug 29, 2022

Thanks a lot for the issue @jonatanklosko !

This indeed seems a bit strange, I see two solutions here

  • We could combine causal mask and attention mask and see what happens
  • Or instead of using additive masks, we could replace the masked values with large negative numbers.

however, is it really a bug ? as long as the values for masked positions are much lower than non-masked tokens, those tokens will still be ignored. Do you have an example where you see the masked positions are not ignored ?

@seanmor5 good point!

I’ve also experienced this issue, and anecdotally this inconsistency seems to impact the quality of stable diffusion outputs from https://github.com/huggingface/diffusers

I’m not sure if this impacts the quality of diffusers, for example, as discussed in this issue, we have verified that the results are 1:1 with the original repo.

I am assuming Stable Diffusion was trained using the PT CLIPTextModel, and thus results rely on this inconsistent/invalid text embedding?

Yes, it was trained using CLIPTextModel, but both this training and the actual pre-trained CLIP model never used attention mask. They always pad the sequence to max_len 77 and use causal mask. This is how we recommend to use the stable diffusion model. I know this is not ideal but that’s how it was trained. cf https://github.com/CompVis/stable-diffusion/blob/main/ldm/modules/encoders/modules.py#L155

Also cc @patrickvonplaten , wdyt ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

CLIPTextModel gives invalid output for zeroed attention mask
I am assuming Stable Diffusion was trained using the PT CLIPTextModel, and thus results rely on this inconsistent/invalid text embedding?
Read more >
output padding different to zero in hidden layers with attention ...
Bug On the last layer, the token corresponding to the padding does not return 0, even if attention masking is used.
Read more >
Glossary - Hugging Face
The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them....
Read more >
padding and attention mask does not work as intended in ...
Use attention mask [1,1,1], that means attend even on the padding zero, you get the same output which is \n . Use the...
Read more >
Masking in Transformers' self-attention mechanism | - Medium
Masking is needed to prevent the attention mechanism of a transformer from “cheating” in the decoder when training (on a translating task for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found