Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discussion on training issues I have encountered

See original GitHub issue

Thank you for the implementation for the paper. This is the first time I’m dealing with transformer model, I tried to train over Kinetics700 dataset using this model. and I just want to share some of the issues I have encountered:

The paper suggested that the model works better with pretrained weights. Although this is a direct extension from image transformer, most of vision transformer’s weights should apply directly, there are 2 places that are different:

Positional encoding: we have H x W x T instead of HxW. so I copied the same positional encoding for every frame, sort of like how we inflate imagenet weights on I3D without dividing by T. One alternative way I’m thinking is use angular initialization to generate an 1XT positional encoding and then add to the HxW image positional encoding to form the HxWxT positional encoding.
We are doing two self-attentions now instead of one per block, so there are double amount of weights for qkv and output fcs. For now per block I use the same weights for the first and second self-attention if I use the same number of heads as the pretrained image model. Alternatively, in a different model I have half number of heads so the time attention and spatial attention will each use half of the heads weights.

Since it is the first time I’m dealing with Transformer, I want to reproduce what the paper claimed so I started with the “original” basic vision transformer setup:

12 heads 12 blocks
GELU instead of GEGLU
embedding size 768
Image size 224, divide to 16x16 patches

With this setup, on a V100 GPU we can only squeeze in 4 videos (4x8x3x224x224) for training even with torch.amp , this means if I’m doing an experiment on an p3x8 machine with 4 V100 gpus (~ 12$/h normally), it would take 39 days to do 300 epochs. Of course it may not need to train for 300 epochs, but intuitively, training with batch size = 16 is not usually not optimal.

So alternatively, I tried a new model with 6 heads and 8 blocks, Now I can put 16 videos per GPU, so in total batch size = 64. The model started to train smoothly then training error increases after 7-8 epochs. The training accuracy peaked around 55% and I didn’t even bother to run validation because I know clearly it’s not working. Below list the relevant configuration I was using.

DATA:
  NUM_FRAMES: 8
  SAMPLING_RATE: 16
  TRAIN_JITTER_SCALES: [256, 320]
  TRAIN_CROP_SIZE: 224
  # TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 1
  TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 3
  INPUT_CHANNEL_NUM: [3]
  DECODING_BACKEND: torchvision
  MEAN: [0.5, 0.5, 0.5]
  STD: [0.5, 0.5, 0.5]
  WEIGHT_DECAY: 0.0
SOLVER:
  BASE_LR: 0.1 # 1 machine
  BASE_LR_SCALE_NUM_SHARDS: True
  LR_POLICY: cosine
  MAX_EPOCH: 300
  WEIGHT_DECAY: 5e-5
  WARMUP_EPOCHS: 35.0
  WARMUP_START_LR: 0.01
  OPTIMIZING_METHOD: sgd
TRANSFORMER:
  TOKEN_DIM: 768
  PATCH_SIZE: 16
  DEPTH: 8
  HEADS: 6
  HEAD_DIM: 64
  FF_DROPOUT: 0.1
  ATTN_DROPOUT: 0.0

So these are the issues are I have encountered for now. I want to share these because hopefully some of you are actually working with video model and maybe we can have a discussion. I think probably my next thing to try is to increase number of depth.

Regards

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:33

Top GitHub Comments

6reactions

mckinziebrandoncommented, Apr 6, 2021

Thanks @zmy1116. I was able to load the pretrained ViT weights into TimeSformer with the following modifications.

Replace GEGLU with nn.Gelu in the FeedForward implementation.
Replace the to_patch_embedding with the PatchEmbed class in timm and renamed to patch_embed.
Minor reshaping tweaks like self.cls_token = nn.Parameter(torch.randn(1, 1, dim)) (instead of (1, dim)) for compatibility with the ViT model weights.

I used the following regex mapping to go from the ViT weight names to the TimeSformer names:

    mapping = {
        'cls_token': 'timesformer.cls_token',
        'patch_embed\.(.*)': 'timesformer.patch_embed.\1',
        r'blocks\.(\d+).norm1\.(.*)': r'timesformer.layers.\1.1.norm.\2',
        r'blocks\.(\d+).norm2\.(.*)': r'timesformer.layers.\1.2.norm.\2',
        r'blocks\.(\d+).attn\.qkv\.weight': r'timesformer.layers.\1.1.fn.to_qkv.weight',
        r'blocks\.(\d+).attn\.proj\.(.*)': r'timesformer.layers.\1.1.fn.to_out.0.\2',
        r'blocks\.(\d+).mlp\.fc1\.(.*)': r'timesformer.layers.\1.2.fn.net.0.\2',
        r'blocks\.(\d+).mlp\.fc2\.(.*)': r'timesformer.layers.\1.2.fn.net.3.\2',
    }

The pretrained model I used was obtained through the timm library via:

vit_base_patch32_224_in21k = timm.create_model(
    'vit_base_patch32_224_in21k',
    pretrained=True)

Finally, I also tried initializing the temporal attention submodule’s weights to zeros, as recommended by the ViViT paper:

            def zero(m):
                if hasattr(m, 'weight') and m.weight is not None:
                    nn.init.zeros_(m.weight)
                if hasattr(m, 'bias') and m.bias is not None:
                    nn.init.zeros_(m.bias)

            for layer in self.timesformer.layers:
                prenorm_temporal_attn: nn.Module = layer[0]
                prenorm_temporal_attn.apply(zero)

Note: I’m using an internal framework so a full copy/paste of my code wouldn’t make sense to anyone, but the above description is everything I’ve tried so far. Still need to tweak/debug more though. After 80 epochs I’m still only getting about 55% validation accuracy on Kinetics-400 (@Hanqer), compared to the 40% I was getting without using pretrained ViT weights.

Also, FWIW I am able to overfit the training data quite easily (no surprise there) and reach nearly 100 percent training accuracy with enough epochs.

3reactions

zmy1116commented, Apr 6, 2021

@Hanqer @Tonyfy

With multiple rounds of changes and testing, I am able to reproduce similar (not better) result on Kinetics700_2020 with video transformer.

I did the following:

based on google’s paper https://arxiv.org/pdf/2103.15691.pdf, I implemented their model 2, their model 3 is exactly TimeSformer and the model 2 use spatial attention and time attention into 2 separate stages. They claim that doing this is better , also it is now a smaller model, we can fit a batch of 64 with 8 v100. The model modification is very straight forward based on this repo.
It is important to find good learning rate and I have to do learning rate warm up, otherwise training diverges from very beginning, for kinetics I use the following

SOLVER:
 BASE_LR: 0.05 # 1 machine
 BASE_LR_SCALE_NUM_SHARDS: True
 LR_POLICY: cosine
 MAX_EPOCH: 30
 WEIGHT_DECAY: 5e-5
 WARMUP_EPOCHS: 1.0
 WARMUP_START_LR: 0.01
 OPTIMIZING_METHOD: sgd

I included color jittering in augmentation, not sure how much this helped

After 30 epochs I’m getting ~62% accuracy on kinetics700_2020 with multi-views. My best model (with X3D-M) on this dataset was ~63.4% with multi-views. I don’t think it’s good but I don’t see any results for this dataset online. The closest public model I could find on this is from SenseTime’s lab on K700, they get 64% with multi-views.

So I would say with video transformer i can get a reasonable model , and the training time is around 30h on an 8GPU machine, which I find very interesting.