copy detection
See original GitHub issue@mathildecaron31 I have a question about copy detection. I am trying to evaluate the pretrained DINO models on a dataset for copy detection task and I am trying to follow the steps from the paper. Even with different image input sizes in Table 4 we see that final embedding dimension is 1536. I am not able to understand how we can get same embedding dimension after concatenating CLS embedding and GeM pooled output patch tokens for different input image sizes. Maybe I am missing a point here. Here is what I did:
Added the following method to VisionTransformer
to return output patch tokens and cls output.
def forward_output_patch_tokens_cls(self, x):
B = x.shape[0]
x = self.patch_embed(x)
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
pos_embed = self.interpolate_pos_encoding(x, self.pos_embed)
x = x + pos_embed
x = self.pos_drop(x)
for blk in self.blocks:
x = blk(x)
if self.norm is not None:
x = self.norm(x)
return x
Using GeM module from here
def gem(x, p=3, eps=1e-6):
"x: BS x num tokens x embed_dim"
return F.avg_pool1d(x.clamp(min=eps).pow(p), (x.size(-1))).pow(1./p)
class GeM(nn.Module):
def __init__(self, p=3, eps=1e-6):
super(GeM,self).__init__()
self.p = nn.Parameter(torch.ones(1)*p)
self.eps = eps
def forward(self, x):
return gem(x, p=self.p, eps=self.eps)
def __repr__(self):
return self.__class__.__name__ + '(' + 'p=' + '{:.4f}'.format(self.p.data.tolist()[0]) + ', ' + 'eps=' + str(self.eps) + ')'
Collect embeddings (CLS + GeM Pooled Output Patch Tokens)
all_image_features = []
with torch.no_grad():
for imgb in progress_bar(image_dl):
outputs = model.forward_output_patch_tokens_cls(imgb.cuda())
cls_token, output_patch_tokens = outputs[:,0],outputs[:,1:]
cls_features = cls_token
patch_features = gem_pooling(output_patch_tokens.permute(0,2,1)).squeeze(-1)
concat_features = torch.cat([cls_features,patch_features],dim=-1)
all_image_features.append(concat_features.cpu())
Following this and using an image size of 224 for dino_vitb8
my final embedding dimension is 1568 1536. Which can also be calculated as:
cls_feature_dim*2 = 768*2
Question
Also, during copy detection task do you learn the pooling parameter p
or is it picked based on validation set? I didn’t quite understand the whitening part is it same as regular unsupervised PCA?
Found this paper: https://hal.inria.fr/hal-00722622v2/document. I believe idea is coming from here.
Edit:
Figured out the 1536 dimension size. We need to pool across token positions, so this gives pooled embedding with same dimension as cls token embedding dimension.
_Originally posted by @KeremTurgutlu in https://github.com/facebookresearch/dino/issues/8#issuecomment-833180355_
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Hi @luoyaxiong
Yes I can try to do that in the following days (I don’t have much bandwidth tbh). The code is very similar to
eval_knn.py
. Let me know if you have any specific questions in the meantime.https://github.com/facebookresearch/dino/commit/ba9edd18db78a99193005ef991e04d63984b25a8