question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ViT for resolution beyond 224x224 support

See original GitHub issue

When the resolution changes, the size of position embedding of ViTModel also changes, which makes from_pretrained method not working.

So, how can I use ViT with a different resolution like 64x64?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
NielsRoggecommented, Jun 15, 2021

I think you first need to load the state_dict of the original model, like so:

from transformers import ViTModel

model = ViTModel.from_pretrained('google/vit-base-patch16-224') # load pretrained model
state_dict = model.state_dict()

Then, initialize a new ViTModel with custom image_size, update the position embeddings of the state_dict and load the new model with that state_dict:

from transformers import ViTConfig

config = ViTConfig.from_pretrained('google/vit-base-patch16-224', image_size=64)
# new model with custom image_size
model = ViTModel(config=config)

# update state_dict
new_state_dict = state_dict.copy()
old_posemb = new_state_dict['embeddings.position_embeddings']
if model.embeddings.position_embeddings.shape != old_posemb.shape: # need to resize the position embedding by interpolate
    new_posemb = resize_pos_embed(old_posemb, model.embeddings.position_embeddings) # use PyTorch function linked above
    new_state_dict['embeddings.position_embeddings'] = new_posemb

# equip new model with state_dict
model.load_state_dict(new_state_dict)
1reaction
NielsRoggecommented, Jun 15, 2021

One would need to interpolate the pre-trained position embeddings. You can see how this is done in the original implementation here.

You can find a PyTorch implementation of that here (taken from the T2T-ViT implementation), where they show how you can go from 224 to 384. The pre-trained position embeddings are of shape (1, 197, 768) - there are 196 “positions” in an image of 224x224 with a patch size of 16x16 as (224/16)^2 = 196 and we add 1 for the [CLS] token - and suppose you want to fine-tune at resolution of 64x64 with a patch size of 8, then the number of position embeddings is (64/8)^2 + 1 = 65. In that case, the position embeddings during fine-tuning are of shape (1, 65, 768), and you can use that function to map the pre-trained position embeddings from shape (1, 197, 768) to (1, 65, 768).

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Multi-Axis Approach for Vision Transformer and MLP Models ...
The Vision Transformer (ViT) has created a new landscape of model designs for computer vision ... FLOPs performance scaling with 224x224 image resolution....
Read more >
Scale Vision Transformers (ViT) Beyond Hugging Face
Speed up state-of-the-art ViT models in Hugging Face up to 2300% (25x ... 2012 (1 million images, 1,000 classes) at resolution 224x224:.
Read more >
Transformers for Image Recognition at Scale | OpenReview
The ViT-L/16 model takes 16 pixels x16 pixels patches as input. For example, we pretrain our models with 224x224 resolution input images, which...
Read more >
EfficientViT: Enhanced Linear Attention for High-Resolution ...
Despite the great success of ViT in the low-resolution ... We build our model to have around 400M MACs under a 224x224 input...
Read more >
Vision Transformers (ViT) Explained - Pinecone
A deep dive into the Vision Transformer (ViT) and practical implementation. ... If, instead, we split a 224x224 pixel image into 256 14x14...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found