Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Redundant normalisation of image and text features in OWL-ViT

See original GitHub issue

Who can help?

@alaradirik

Issue description

Hi,

Thank you for the codebase! As the title suggests, I think that in modeling_owlvit.py the image and text features are normalised twice while in the original codebase from Google Research they are normalised only once. In particular, in modeling_owlvit.py image and text features are normalised both in lines 1073-174 and in lines 1145-1146. On the contrary in the original code, in https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/layers.py, the features are normalised only in lines 86-89 whereas in line 144 the normalisation parameter is set as normalize=False and there is a comment explicitly saying Don't normalize image and text embeddings:.

I think this is sensible as there is no reason for double normalisation which normally leads to performance degredation. Please let me know what do you think, and whether I’m wrong as I might be missing something.

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

ekazakoscommented, Oct 11, 2022

Glad I could help! Could you please let me know whether this boosts validation performance at all?

0reactions

alaradirikcommented, Oct 18, 2022

Hey @ekazakos, sorry for the delay! The issue will be fixed with this PR but it doesn’t affect the performance as double normalization yields the same results.

Top Results From Across the Web

OWL-ViT - Hugging Face

OWL-ViT is a zero-shot text-conditioned object detection model. ... used to resize (or rescale) and normalize images for the model and CLIPTokenizer is...

Unified Contrastive Learning in Image-Text-Label Space

In this paper, we extend its scope to the unified visual domain, which incorporates both image and video data for cross-modal pretraining via...

Image recognition performance enhancements using image ...

In this paper, we propose a method to enhance the image recognition performance through feature extraction and image normalization called ...

Zero-shot object detection with OWL-ViT - Segments.ai

You can use this tool to interactively find text queries and thresholds that work well on your images. You can also leverage zero-shot...

Why normalize images by subtracting dataset's image mean ...

Subtracting the dataset mean serves to "center" the data. Additionally, you ideally would like to divide by the sttdev of that feature or ......