Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Auto-normalization and CLIP embeddings

See original GitHub issue

I have an x-clip model trained on normalized image data in the range [-1, 1] with good loss performance on a custom dataset.

I’ve tried to re-use the same custom dataset/dataloader that provides normalized images with the models in this repo but have run into issues with the auto-normalization.

If I leave the images in the range [-1, 1] then the forward pass in the Decoder auto generates the image embeddings from the CLIP adapter and passes the images as-is, which works. However, the p_losses pass then performs normalize_neg_one_to_one which shifts the range to [-3, 1].

If I leave the images in the range [0, 1] then the forward pass in the Decoder passes the wrong image data into the CLIP adapter while the p_losses pass is correct with [-1, 1].

As a workaround I can compute image_embed/text_embed manually and pass them into the Decoder, but this takes up extra memory or performance from having to discard the normalized results after and recompute them.

I think the control to disable auto-normalization would be beneficial here, but maybe this is more of an error training with normalized image data on CLIP itself.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

lucidrainscommented, May 19, 2022

@marunine great! if you get anything working with x-clip as the base, do send me an email and let me know 😄

1reaction

maruninecommented, May 19, 2022

Thanks for the quick response!

I think the only issue with a normalization_fn with the XClipAdapter is that it implies that the image data would need to be normalized. I’m not sure if the model is more numerically stable or trains better if it’s in [-1, 1] vs [0, 1], but anecdotally it worked for me.

Top Results From Across the Web

Better normalize input CLIP embeddings when training prior #60

With a 512 dim CLIP embedding, normalizing to norm 1 means that the values will be around the range -0.06 to 0.06 or...

Simple but Effective: CLIP Embeddings for Embodied AI - arXiv

Abstract: Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from ...

CLIP in AutoMM - Extract Embeddings - AutoGluon

In this tutorial, we will show you how to use AutoGluon to extract embeddings from CLIP, and then use it for a retrieval...

CLIP - Hugging Face

The CLIP model was proposed in Learning Transferable Visual Models From ... The authors also add absolute position embeddings, and feed the resulting ......

Quick-fire Guide to Multi-Modal ML With OpenAI's CLIP

An introduction to embedding text and images with the Hugging Face transformers implementation of OpenAI's CLIP. A great multi-modal model for NLP and...