Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CLIPTextModel gives invalid output for zeroed attention mask

See original GitHub issue

System Info

transformers version: 4.21.2
Platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.33
Python version: 3.8.6
Huggingface_hub version: 0.5.1
PyTorch version (GPU?): 1.12.0+cu102 (False)
Tensorflow version (GPU?): 2.6.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.4.2 (cpu)
Jax version: 0.3.10
JaxLib version: 0.3.10
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@patil-suraj

Reproduction

from transformers import CLIPTextModel
import torch

model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")

inputs = {
  "input_ids": torch.tensor([[49406, 320, 1125, 539, 320, 1929, 49407]]),
  "attention_mask": torch.tensor([[0, 0, 0, 0, 0, 0, 0]])
}

outputs = model(**inputs)

Given the zeroed attention mask, the attention weights should be all equal here:

https://github.com/huggingface/transformers/blob/21f6f58721dd9154357576be6de54eefef1f1818/src/transformers/models/clip/modeling_clip.py#L246

However, causal and attention masks are added separately (here), so in this case, before going through softmax, certain values are twice as small as the other ones (to be more precise, some values are -min_float and other are -inf). Consequently, softmax outputs probabilities that match the causal mask.

This is also the case for TFCLIPTextModel.

Issue Analytics

State:
Created a year ago
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

jonatankloskocommented, Sep 2, 2022

So I think we can close this 😃

The additive mask before softmax is a trick that works under the assumption that the mask has at least a single 1. So from what I understand can be said that for zeroed mask the output of most models is just not well defined, and it’s fine given the use cases so far.

1reaction

patil-surajcommented, Aug 29, 2022

Thanks a lot for the issue @jonatanklosko !

This indeed seems a bit strange, I see two solutions here

We could combine causal mask and attention mask and see what happens
Or instead of using additive masks, we could replace the masked values with large negative numbers.

however, is it really a bug ? as long as the values for masked positions are much lower than non-masked tokens, those tokens will still be ignored. Do you have an example where you see the masked positions are not ignored ?

@seanmor5 good point!

I’ve also experienced this issue, and anecdotally this inconsistency seems to impact the quality of stable diffusion outputs from https://github.com/huggingface/diffusers

I’m not sure if this impacts the quality of diffusers, for example, as discussed in this issue, we have verified that the results are 1:1 with the original repo.

I am assuming Stable Diffusion was trained using the PT CLIPTextModel, and thus results rely on this inconsistent/invalid text embedding?

Yes, it was trained using CLIPTextModel, but both this training and the actual pre-trained CLIP model never used attention mask. They always pad the sequence to max_len 77 and use causal mask. This is how we recommend to use the stable diffusion model. I know this is not ideal but that’s how it was trained. cf https://github.com/CompVis/stable-diffusion/blob/main/ldm/modules/encoders/modules.py#L155

Also cc @patrickvonplaten , wdyt ?

Top Results From Across the Web

CLIPTextModel gives invalid output for zeroed attention mask

I am assuming Stable Diffusion was trained using the PT CLIPTextModel, and thus results rely on this inconsistent/invalid text embedding?

output padding different to zero in hidden layers with attention ...

Bug On the last layer, the token corresponding to the padding does not return 0, even if attention masking is used.

Glossary - Hugging Face

The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them....

padding and attention mask does not work as intended in ...

Use attention mask [1,1,1], that means attend even on the padding zero, you get the same output which is \n . Use the...

Masking in Transformers' self-attention mechanism | - Medium

Masking is needed to prevent the attention mechanism of a transformer from “cheating” in the decoder when training (on a translating task for...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

CLIPTextModel gives invalid output for zeroed attention mask

System Info

Who can help?

Reproduction

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Changing a single example for BLOOM 176-B affects forward pass for other examples in a batch

UnimplementedError: The Conv2D op currently does not support grouped convolutions on the CPU.