Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ViT for resolution beyond 224x224 support

See original GitHub issue

When the resolution changes, the size of position embedding of ViTModel also changes, which makes from_pretrained method not working.

So, how can I use ViT with a different resolution like 64x64?

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

NielsRoggecommented, Jun 15, 2021

I think you first need to load the state_dict of the original model, like so:

from transformers import ViTModel

model = ViTModel.from_pretrained('google/vit-base-patch16-224') # load pretrained model
state_dict = model.state_dict()

Then, initialize a new ViTModel with custom image_size, update the position embeddings of the state_dict and load the new model with that state_dict:

from transformers import ViTConfig

config = ViTConfig.from_pretrained('google/vit-base-patch16-224', image_size=64)
# new model with custom image_size
model = ViTModel(config=config)

# update state_dict
new_state_dict = state_dict.copy()
old_posemb = new_state_dict['embeddings.position_embeddings']
if model.embeddings.position_embeddings.shape != old_posemb.shape: # need to resize the position embedding by interpolate
    new_posemb = resize_pos_embed(old_posemb, model.embeddings.position_embeddings) # use PyTorch function linked above
    new_state_dict['embeddings.position_embeddings'] = new_posemb

# equip new model with state_dict
model.load_state_dict(new_state_dict)

1reaction

NielsRoggecommented, Jun 15, 2021

One would need to interpolate the pre-trained position embeddings. You can see how this is done in the original implementation here.

You can find a PyTorch implementation of that here (taken from the T2T-ViT implementation), where they show how you can go from 224 to 384. The pre-trained position embeddings are of shape (1, 197, 768) - there are 196 “positions” in an image of 224x224 with a patch size of 16x16 as (224/16)^2 = 196 and we add 1 for the [CLS] token - and suppose you want to fine-tune at resolution of 64x64 with a patch size of 8, then the number of position embeddings is (64/8)^2 + 1 = 65. In that case, the position embeddings during fine-tuning are of shape (1, 65, 768), and you can use that function to map the pre-trained position embeddings from shape (1, 197, 768) to (1, 65, 768).

Top Results From Across the Web

A Multi-Axis Approach for Vision Transformer and MLP Models ...

The Vision Transformer (ViT) has created a new landscape of model designs for computer vision ... FLOPs performance scaling with 224x224 image resolution....

Scale Vision Transformers (ViT) Beyond Hugging Face

Speed up state-of-the-art ViT models in Hugging Face up to 2300% (25x ... 2012 (1 million images, 1,000 classes) at resolution 224x224:.

Transformers for Image Recognition at Scale | OpenReview

The ViT-L/16 model takes 16 pixels x16 pixels patches as input. For example, we pretrain our models with 224x224 resolution input images, which...

EfficientViT: Enhanced Linear Attention for High-Resolution ...

Despite the great success of ViT in the low-resolution ... We build our model to have around 400M MACs under a 224x224 input...

Vision Transformers (ViT) Explained - Pinecone

A deep dive into the Vision Transformer (ViT) and practical implementation. ... If, instead, we split a 224x224 pixel image into 256 14x14...