question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Auto-normalization and CLIP embeddings

See original GitHub issue

I have an x-clip model trained on normalized image data in the range [-1, 1] with good loss performance on a custom dataset.

I’ve tried to re-use the same custom dataset/dataloader that provides normalized images with the models in this repo but have run into issues with the auto-normalization.

If I leave the images in the range [-1, 1] then the forward pass in the Decoder auto generates the image embeddings from the CLIP adapter and passes the images as-is, which works. However, the p_losses pass then performs normalize_neg_one_to_one which shifts the range to [-3, 1].

If I leave the images in the range [0, 1] then the forward pass in the Decoder passes the wrong image data into the CLIP adapter while the p_losses pass is correct with [-1, 1].

As a workaround I can compute image_embed/text_embed manually and pass them into the Decoder, but this takes up extra memory or performance from having to discard the normalized results after and recompute them.

I think the control to disable auto-normalization would be beneficial here, but maybe this is more of an error training with normalized image data on CLIP itself.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
lucidrainscommented, May 19, 2022

@marunine great! if you get anything working with x-clip as the base, do send me an email and let me know 😄

1reaction
maruninecommented, May 19, 2022

Thanks for the quick response!

I think the only issue with a normalization_fn with the XClipAdapter is that it implies that the image data would need to be normalized. I’m not sure if the model is more numerically stable or trains better if it’s in [-1, 1] vs [0, 1], but anecdotally it worked for me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Better normalize input CLIP embeddings when training prior #60
With a 512 dim CLIP embedding, normalizing to norm 1 means that the values will be around the range -0.06 to 0.06 or...
Read more >
Simple but Effective: CLIP Embeddings for Embodied AI - arXiv
Abstract: Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from ...
Read more >
CLIP in AutoMM - Extract Embeddings - AutoGluon
In this tutorial, we will show you how to use AutoGluon to extract embeddings from CLIP, and then use it for a retrieval...
Read more >
CLIP - Hugging Face
The CLIP model was proposed in Learning Transferable Visual Models From ... The authors also add absolute position embeddings, and feed the resulting ......
Read more >
Quick-fire Guide to Multi-Modal ML With OpenAI's CLIP
An introduction to embedding text and images with the Hugging Face transformers implementation of OpenAI's CLIP. A great multi-modal model for NLP and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found