ViT for resolution beyond 224x224 support
See original GitHub issueWhen the resolution changes, the size of position embedding of ViTModel also changes, which makes from_pretrained
method not working.
So, how can I use ViT with a different resolution like 64x64?
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
A Multi-Axis Approach for Vision Transformer and MLP Models ...
The Vision Transformer (ViT) has created a new landscape of model designs for computer vision ... FLOPs performance scaling with 224x224 image resolution....
Read more >Scale Vision Transformers (ViT) Beyond Hugging Face
Speed up state-of-the-art ViT models in Hugging Face up to 2300% (25x ... 2012 (1 million images, 1,000 classes) at resolution 224x224:.
Read more >Transformers for Image Recognition at Scale | OpenReview
The ViT-L/16 model takes 16 pixels x16 pixels patches as input. For example, we pretrain our models with 224x224 resolution input images, which...
Read more >EfficientViT: Enhanced Linear Attention for High-Resolution ...
Despite the great success of ViT in the low-resolution ... We build our model to have around 400M MACs under a 224x224 input...
Read more >Vision Transformers (ViT) Explained - Pinecone
A deep dive into the Vision Transformer (ViT) and practical implementation. ... If, instead, we split a 224x224 pixel image into 256 14x14...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think you first need to load the
state_dict
of the original model, like so:Then, initialize a new
ViTModel
with customimage_size
, update the position embeddings of thestate_dict
and load the new model with thatstate_dict
:One would need to interpolate the pre-trained position embeddings. You can see how this is done in the original implementation here.
You can find a PyTorch implementation of that here (taken from the T2T-ViT implementation), where they show how you can go from 224 to 384. The pre-trained position embeddings are of shape (1, 197, 768) - there are 196 “positions” in an image of 224x224 with a patch size of 16x16 as (224/16)^2 = 196 and we add 1 for the [CLS] token - and suppose you want to fine-tune at resolution of 64x64 with a patch size of 8, then the number of position embeddings is (64/8)^2 + 1 = 65. In that case, the position embeddings during fine-tuning are of shape (1, 65, 768), and you can use that function to map the pre-trained position embeddings from shape (1, 197, 768) to (1, 65, 768).