Discussion on training issues I have encountered
See original GitHub issueThank you for the implementation for the paper. This is the first time I’m dealing with transformer model, I tried to train over Kinetics700 dataset using this model. and I just want to share some of the issues I have encountered:
The paper suggested that the model works better with pretrained weights. Although this is a direct extension from image transformer, most of vision transformer’s weights should apply directly, there are 2 places that are different:
- Positional encoding: we have H x W x T instead of HxW. so I copied the same positional encoding for every frame, sort of like how we inflate imagenet weights on I3D without dividing by T. One alternative way I’m thinking is use angular initialization to generate an 1XT positional encoding and then add to the HxW image positional encoding to form the HxWxT positional encoding.
- We are doing two self-attentions now instead of one per block, so there are double amount of weights for qkv and output fcs. For now per block I use the same weights for the first and second self-attention if I use the same number of heads as the pretrained image model. Alternatively, in a different model I have half number of heads so the time attention and spatial attention will each use half of the heads weights.
Since it is the first time I’m dealing with Transformer, I want to reproduce what the paper claimed so I started with the “original” basic vision transformer setup:
- 12 heads 12 blocks
- GELU instead of GEGLU
- embedding size 768
- Image size 224, divide to 16x16 patches
With this setup, on a V100 GPU we can only squeeze in 4 videos (4x8x3x224x224) for training even with torch.amp
, this means if I’m doing an experiment on an p3x8 machine with 4 V100 gpus (~ 12$/h normally), it would take 39 days to do 300 epochs. Of course it may not need to train for 300 epochs, but intuitively, training with batch size = 16 is not usually not optimal.
So alternatively, I tried a new model with 6 heads and 8 blocks, Now I can put 16 videos per GPU, so in total batch size = 64. The model started to train smoothly then training error increases after 7-8 epochs. The training accuracy peaked around 55% and I didn’t even bother to run validation because I know clearly it’s not working. Below list the relevant configuration I was using.
DATA:
NUM_FRAMES: 8
SAMPLING_RATE: 16
TRAIN_JITTER_SCALES: [256, 320]
TRAIN_CROP_SIZE: 224
# TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 1
TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 3
INPUT_CHANNEL_NUM: [3]
DECODING_BACKEND: torchvision
MEAN: [0.5, 0.5, 0.5]
STD: [0.5, 0.5, 0.5]
WEIGHT_DECAY: 0.0
SOLVER:
BASE_LR: 0.1 # 1 machine
BASE_LR_SCALE_NUM_SHARDS: True
LR_POLICY: cosine
MAX_EPOCH: 300
WEIGHT_DECAY: 5e-5
WARMUP_EPOCHS: 35.0
WARMUP_START_LR: 0.01
OPTIMIZING_METHOD: sgd
TRANSFORMER:
TOKEN_DIM: 768
PATCH_SIZE: 16
DEPTH: 8
HEADS: 6
HEAD_DIM: 64
FF_DROPOUT: 0.1
ATTN_DROPOUT: 0.0
So these are the issues are I have encountered for now. I want to share these because hopefully some of you are actually working with video model and maybe we can have a discussion. I think probably my next thing to try is to increase number of depth.
Regards
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:33
Top GitHub Comments
Thanks @zmy1116. I was able to load the pretrained ViT weights into TimeSformer with the following modifications.
to_patch_embedding
with the PatchEmbed class in timm and renamed topatch_embed
.self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
(instead of (1, dim)) for compatibility with the ViT model weights.I used the following regex mapping to go from the ViT weight names to the TimeSformer names:
The pretrained model I used was obtained through the
timm
library via:Finally, I also tried initializing the temporal attention submodule’s weights to zeros, as recommended by the ViViT paper:
Note: I’m using an internal framework so a full copy/paste of my code wouldn’t make sense to anyone, but the above description is everything I’ve tried so far. Still need to tweak/debug more though. After 80 epochs I’m still only getting about 55% validation accuracy on Kinetics-400 (@Hanqer), compared to the 40% I was getting without using pretrained ViT weights.
Also, FWIW I am able to overfit the training data quite easily (no surprise there) and reach nearly 100 percent training accuracy with enough epochs.
@Hanqer @Tonyfy
With multiple rounds of changes and testing, I am able to reproduce similar (not better) result on Kinetics700_2020 with video transformer.
I did the following:
based on google’s paper https://arxiv.org/pdf/2103.15691.pdf, I implemented their model 2, their model 3 is exactly TimeSformer and the model 2 use spatial attention and time attention into 2 separate stages. They claim that doing this is better , also it is now a smaller model, we can fit a batch of 64 with 8 v100. The model modification is very straight forward based on this repo.
It is important to find good learning rate and I have to do learning rate warm up, otherwise training diverges from very beginning, for kinetics I use the following
After 30 epochs I’m getting ~62% accuracy on kinetics700_2020 with multi-views. My best model (with X3D-M) on this dataset was ~63.4% with multi-views. I don’t think it’s good but I don’t see any results for this dataset online. The closest public model I could find on this is from SenseTime’s lab on K700, they get 64% with multi-views.
So I would say with video transformer i can get a reasonable model , and the training time is around 30h on an 8GPU machine, which I find very interesting.