Using different encoders in CLIP
See original GitHub issueHi, I am wondering if it was possible to use different encoders in CLIP ? For images not using vit but resnet for example. And is it possible to replace the text encoder by a features encoder for example ? If I have a vector of features for a given image and I want to use x-clip how should I do that ? I have made a code example that doesnt seems to work, here is what I did:
import torch
from x_clip import CLIP
import torch.nn as nn
from torchvision import models
class Image_Encoder(torch.nn.Module):
#output size is (bs,512)
def __init__(self):
super(Image_Encoder, self).__init__()
self.model_pre = models.resnet18(pretrained=False)
self.base=nn.Sequential(*list(self.model_pre.children()))
self.base[0]=nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
self.resnet=self.base[:-1]
def forward(self, x):
out=self.resnet(x).squeeze()
return out
class features_encoder(torch.nn.Module):
#output size is (bs,512)
def __init__(self):
super(features_encoder, self).__init__()
self.model =nn.Linear(2048,512)
def forward(self, x):
out=self.model(x)
return out
images_encoder=Image_Encoder()
features_encoder=features_encoder()
clip = CLIP(
image_encoder = images_encoder,
text_encoder = features_encoder,
dim_image = 512,
dim_text = 512,
dim_latent = 512
)
features= torch.randn(4,2048)
images = torch.randn(4, 3, 256, 256)
loss = clip(features, images, return_loss = True)
loss.backward()
but I got the following error : forward() takes 2 positional arguments but 3 were given
Thanks
Issue Analytics
- State:
- Created a year ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
CLIP - Hugging Face
CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. CLIP uses a...
Read more >How to Train your CLIP | by Federico Bianchi | Medium
Encoding. The assumption behind CLIP is very simple: you need to have an image encoder and a text encoder. Each of these will...
Read more >A Beginner's Guide to the CLIP Model - KDnuggets
The CLIP model is no different: the text encoder and image encoder are fit to maximize goodness and minimize badness.
Read more >CLIP: Connecting Text and Images - OpenAI
CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then...
Read more >How Much Can CLIP Benefit Vision-and-Language Tasks?
To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
works just fine thanks !
works just fine thanks 😃 Although now visual ssl set to True returns the following error :
EinopsError: Error while processing rearrange-reduction pattern “b n d -> (b n) d”. Input tensor shape: torch.Size([2, 512]). Additional info: {}. Expected 3 dimensions, got 2
Sorry about the trouble aha