Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run MViT on AVA

See original GitHub issue

Hi, thanks for providing this wonderful repository! I’m trying to reproduce the results on AVA dataset with MViT model, but I only achieved ~20 map so far. I built the config file from the implementation details reported in the paper and changed the head of the MViT model to head_helper.ResNetRoIHead, I load the weights from the provided Kinetics checkpoint.

Should I be able to reproduce the results from the paper that way?

Thanks, Elad.

The config file:

TRAIN:
  ENABLE: True
  DATASET: ava
  BATCH_SIZE: 64
  EVAL_PERIOD: 5
  CHECKPOINT_PERIOD: 1
  AUTO_RESUME: True
  CHECKPOINT_FILE_PATH: CPS/Kinetics400/K400_MVIT_B_16x4_CONV.pyth
  CHECKPOINT_TYPE: pytorch
  CHECKPOINT_EPOCH_RESET: True
DATA:
  NUM_FRAMES: 16
  SAMPLING_RATE: 4
  TRAIN_JITTER_SCALES: [256, 320]
  TRAIN_CROP_SIZE: 224
  TEST_CROP_SIZE: 224
  INPUT_CHANNEL_NUM: [3]
  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
DETECTION:
  ENABLE: True
  ALIGNED: True
AVA:
  DETECTION_SCORE_THRESH: 0.8
  TRAIN_PREDICT_BOX_LISTS: [
    "ava_train_v2.2.csv",
    "person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv",
  ]
  TEST_PREDICT_BOX_LISTS: ["person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv"]
  BGR: False
MVIT:
  ZERO_DECAY_POS_CLS: False
  SEP_POS_EMBED: True
  DEPTH: 16
  NUM_HEADS: 1
  EMBED_DIM: 96
  PATCH_KERNEL: (3, 7, 7)
  PATCH_STRIDE: (2, 4, 4)
  PATCH_PADDING: (1, 3, 3)
  MLP_RATIO: 4.0
  QKV_BIAS: True
  DROPPATH_RATE: 0.4
  NORM: "layernorm"
  MODE: "conv"
  CLS_EMBED_ON: False
  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]
  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]
  POOL_KVQ_KERNEL: [3, 3, 3]
  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]
  POOL_Q_STRIDE: [[1, 1, 2, 2], [3, 1, 2, 2], [14, 1, 2, 2]]
  DROPOUT_RATE: 0.0
AUG:
  NUM_SAMPLE: 2
  ENABLE: True
  COLOR_JITTER: 0.4
  AA_TYPE: rand-m7-n4-mstd0.5-inc1
  INTERPOLATION: bicubic
  RE_PROB: 0.25
  RE_MODE: pixel
  RE_COUNT: 1
  RE_SPLIT: False
MIXUP:
  ENABLE: False
  ALPHA: 0.8
  CUTMIX_ALPHA: 1.0
  PROB: 1.0
  SWITCH_PROB: 0.5
  LABEL_SMOOTH_VALUE: 0.1
BN:
  USE_PRECISE_STATS: False
  NUM_BATCHES_PRECISE: 200
SOLVER:
  ZERO_WD_1D_PARAM: True
  CLIP_GRAD_L2NORM: 1.0
  BASE_LR_SCALE_NUM_SHARDS: True
  BASE_LR: 0.6
  COSINE_END_LR: 1e-6
  WARMUP_START_LR: 1e-6
  WARMUP_EPOCHS: 5.0
  LR_POLICY: cosine
  MAX_EPOCH: 30
  MOMENTUM: 0.9
  WEIGHT_DECAY: 1e-8
  OPTIMIZING_METHOD: sgd
  COSINE_AFTER_WARMUP: True
MODEL:
  NUM_CLASSES: 80
  ARCH: mvit
  MODEL_NAME: MViT
  LOSS_FUNC: bce
  DROPOUT_RATE: 0.5
TEST:
  ENABLE: True
  DATASET: ava
  BATCH_SIZE: 8
  NUM_SPATIAL_CROPS: 1
DATA_LOADER:
  NUM_WORKERS: 8
  PIN_MEMORY: True
NUM_GPUS: 8
NUM_SHARDS: 1
RNG_SEED: 0
OUTPUT_DIR: .

Issue Analytics

State:
Created 2 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

4reactions

feichtenhofercommented, Aug 5, 2021

this will be release very soon

0reactions

yuanliangzhecommented, Dec 16, 2022

Hello, may I ask if you are able to reproduce the paper reported results on AVA @eladb3? I still cannot find the corresponding training config nor the checkpoint in the repo @feichtenhofer .

Top Results From Across the Web

AVA v2.2 Benchmark (Action Recognition) - Papers With Code

Rank Model mAP Year Tags 1 VideoMAE (K400 pretrain+finetune, ViT‑H, 16x4) 39.5 2022 2 VideoMAE (K700 pretrain+finetune, ViT‑L, 16x4) 39.3 2022 Vision TransformerSelf... 3 MaskFeat (Kinetics‑600...

val/map (22/10/22 21:19:52) – Weights & Biases - WandB

Publish your model insights with interactive plots for performance metrics, predictions, and hyperparameters. Made by Enrique Sanchez using ...

MViTv2: Improved Multiscale Vision Transformers for ... - arXiv

In this paper, we study Multiscale Vision Transformers. (MViTv2) as a unified architecture for image and video classification, as well as object detection....

Multiscale Vision Transformers: An architecture for modeling ...

It's a family of visual recognition models that incorporate the seminal concept of hierarchical representations into the powerful Transformer architecture. MViT ...

Multiscale Vision Transformers - CVF Open Access

We present Multiscale Vision Transformers (MViT) for ... Charades [92], SSv2 [43] and AVA [44]). MViT ... D × T × H ×...