`interpolate_pos_encoding(x, pos_embed)` doesnt return correct dimension for images that is not square (w != h)
See original GitHub issueI notice the generation of positional embedding in interpolate_pos_encoding
method is slightly different than the one in the forward_selfattention
method. The following simple modification bring both into the same page, to your interest.
def interpolate_pos_encoding(self, x, pos_embed, w, h): # passing w and h as arguments
npatch = x.shape[1] - 1
N = pos_embed.shape[1] - 1
if npatch == N:
return pos_embed
class_emb = pos_embed[:, 0]
pos_embed = pos_embed[:, 1:]
dim = x.shape[-1]
w0 = w // self.patch_embed.patch_size # just copy paste from forward_selfattention
h0 = h // self.patch_embed.patch_size
pos_embed = nn.functional.interpolate(
pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(0, 3, 1, 2),
scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)), # replace math.sqrt(npatch / N) with one from forward_selfattention
mode='bicubic',
)
pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
return torch.cat((class_emb.unsqueeze(0), pos_embed), dim=1)
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (3 by maintainers)
Top Results From Across the Web
Understanding Digital Image Interpolation
Image interpolation occurs in all digital photos at some stage — whether this be in bayer demosaicing or in photo enlargement. It happens...
Read more >How to get convert to not interpolate pixels?
My astronomical images are pretty low resolution (150 x 300 pixels) but convert seems to make images that are larger than 150 x...
Read more >The Secrets of Colour Interpolation - Alan Zucconi
Leanr how to master colour interpolation with this tutorial. ... Lerping in two dimension only requires to independently lerp the X and Y ......
Read more >Bilinear interpolation - Wikipedia
In mathematics, bilinear interpolation is a method for interpolating functions of two variables (e.g., x and y) using repeated linear interpolation.
Read more >Interpolation with React Native Animations | by evening kid
interpolate. We'll take a regular square with a simple animation that moves it 100 pixels to the right. import React, { useEffect, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @KeremTurgutlu , let me open a new issue 😃
@enverfakhan I have incorporated your suggested fix for the floating point error and have also been trying to improve the forward logic in the vision_transformer.py code. Thanks a lot for your suggestion and feedback is appreciated if you do have some time 😃. https://github.com/facebookresearch/dino/blob/6687929d7cdc2e7a5150f6e24c2b6713293944ac/vision_transformer.py#L174-L233
I’m closing this issue. Feel free to reopen is there is other problem related to the interpolation of the positional encodings.
That’s slightly disappointing 😕. Have you tried the other models ? For example ViT-Base/16 should be more manageable memorywise. As a matter of fact, on copy detection datasets, I’ve found the base models to perform clearly better than the small ones: I get better performance with Base16x16 than with Small8x8 though Small8x8 is better at k-NN ImNet.
Yes your solution is definitely better ! I’ll update that in the code.