Auto-normalization and CLIP embeddings
See original GitHub issueI have an x-clip model trained on normalized image data in the range [-1, 1]
with good loss performance on a custom dataset.
I’ve tried to re-use the same custom dataset/dataloader that provides normalized images with the models in this repo but have run into issues with the auto-normalization.
If I leave the images in the range [-1, 1]
then the forward
pass in the Decoder auto generates the image embeddings from the CLIP adapter and passes the images as-is, which works. However, the p_losses
pass then performs normalize_neg_one_to_one
which shifts the range to [-3, 1]
.
If I leave the images in the range [0, 1]
then the forward
pass in the Decoder passes the wrong image data into the CLIP adapter while the p_losses
pass is correct with [-1, 1]
.
As a workaround I can compute image_embed/text_embed manually and pass them into the Decoder, but this takes up extra memory or performance from having to discard the normalized results after and recompute them.
I think the control to disable auto-normalization would be beneficial here, but maybe this is more of an error training with normalized image data on CLIP itself.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:6 (4 by maintainers)
Top GitHub Comments
@marunine great! if you get anything working with x-clip as the base, do send me an email and let me know 😄
Thanks for the quick response!
I think the only issue with a
normalization_fn
with the XClipAdapter is that it implies that the image data would need to be normalized. I’m not sure if the model is more numerically stable or trains better if it’s in[-1, 1]
vs[0, 1]
, but anecdotally it worked for me.