Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

An error during finetuning for the TVR task

See original GitHub issue

@linjieli222 Hi, I just encountered an error in Quick Start Step3 using 1 GPU:

# inside the container
CUDA_VISIBLE_DEVICES = 0
horovodrun -np 1 python train_vcmr.py --config config/train-tvr-8gpu.json
...
...
[1,0]<stderr>:12/13/2021 09:08:05 - INFO - model.model -        Decoder Transformer config: None
[1,0]<stderr>:12/13/2021 09:08:08 - INFO - model.modeling_utils -   Weights of HeroForVcmr not initialized from pretrained model: ['v_encoder.fom_output.linear_1.weight', 'v_encoder.fom_output.linear_1.bias', 'v_encoder.fom_output.LayerNorm.weight', 'v_encoder.fom_output.LayerNorm.bias', 'v_encoder.fom_output.linear_2.weight', 'v_encoder.fom_output.linear_2.bias']
[1,0]<stderr>:12/13/2021 09:08:08 - INFO - model.modeling_utils -   Weights from pretrained model not used in HeroForVcmr: ['vocab_padded', 'v_encoder.fr_output.linear_1.weight', 'v_encoder.fr_output.linear_1.bias', 'v_encoder.fr_output.LayerNorm.weight', 'v_encoder.fr_output.LayerNorm.bias', 'v_encoder.fr_output.linear_2.weight', 'v_encoder.fr_output.linear_2.bias', 'v_encoder.itm_clip_transform.linear_1.weight', 'v_encoder.itm_clip_transform.linear_1.bias', 'v_encoder.itm_clip_transform.LayerNorm.weight', 'v_encoder.itm_clip_transform.LayerNorm.bias', 'v_encoder.itm_clip_transform.linear_2.weight', 'v_encoder.itm_clip_transform.linear_2.bias', 'v_encoder.itm_sub_transform.linear_1.weight', 'v_encoder.itm_sub_transform.linear_1.bias', 'v_encoder.itm_sub_transform.LayerNorm.weight', 'v_encoder.itm_sub_transform.LayerNorm.bias', 'v_encoder.itm_sub_transform.linear_2.weight', 'v_encoder.itm_sub_transform.linear_2.bias']
[1,0]<stdout>:Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.
[1,0]<stdout>:
[1,0]<stdout>:Defaults for this optimization level are:
[1,0]<stdout>:enabled                : True
[1,0]<stdout>:opt_level              : O2
[1,0]<stdout>:cast_model_type        : torch.float16
[1,0]<stdout>:patch_torch_functions  : False
[1,0]<stdout>:keep_batchnorm_fp32    : True
[1,0]<stdout>:master_weights         : True
[1,0]<stdout>:loss_scale             : dynamic
[1,0]<stdout>:Processing user overrides (additional kwargs that are not None)...
[1,0]<stdout>:After processing overrides, optimization options are:
[1,0]<stdout>:enabled                : True
[1,0]<stdout>:opt_level              : O2
[1,0]<stdout>:cast_model_type        : torch.float16
[1,0]<stdout>:patch_torch_functions  : False
[1,0]<stdout>:keep_batchnorm_fp32    : True
[1,0]<stdout>:master_weights         : True
[1,0]<stdout>:loss_scale             : dynamic
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "train_vcmr.py", line 399, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "train_vcmr.py", line 161, in main
[1,0]<stderr>:    restorer = TrainingRestorer(opts, model, optimizer)
[1,0]<stderr>:  File "/src/utils/save.py", line 141, in __init__
[1,0]<stderr>:    assert vars(opts) == restore_opts
[1,0]<stderr>:AssertionError
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30056,1],0]
  Exit code:    1
--------------------------------------------------------------------------

It seems to be caused by vars(opts) and restore_opts:

vars(opts)= {'model_config': 'config/hero_finetune.json', 'checkpoint': '/pretrain/hero-tv-ht100.pt', 'train_batch_size': 32, 'val_batch_size': 20, 'gradient_accumulation_steps': 2, 'learning_rate': 0.0001, 'valid_steps': 200, 'save_steps': 200, 'optim': 'adamw', 'betas': [
        0.9,
        0.98
    ], 'dropout': 0.1, 'weight_decay': 0.01, 'grad_norm': 1.0, 'warmup_steps': 500, 'lr_mul': 1.0, 'num_train_steps': 5000, 'output_dir': '/storage/tvr_default', 'sub_ctx_len': 0, 'max_clip_len': 100, 'max_txt_len': 60, 'vfeat_version': 'resnet_slowfast', 'vfeat_interval': 1.5, 'compressed_db': False, 'seed': 77, 'n_workers': 4, 'pin_mem': True, 'fp16': True, 'task': 'tvr', 'vcmr_eval_video_batch_size': 50, 'vcmr_eval_q_batch_size': 80, 'drop_svmr_prob': 0.8, 'lw_neg_q': 8.0, 'lw_neg_ctx': 8.0, 'lw_st_ed': 0.01, 'ranking_loss_type': 'hinge', 'margin': 0.1, 'hard_pool_size': [
        20
    ], 'hard_neg_weights': [
        10
    ], 'hard_negtiave_start_step': [
        2000
    ], 'train_span_start_step': 0, 'use_all_neg': True, 'eval_with_query_type': True, 'max_before_nms': 200, 'max_after_nms': 100, 'distributed_eval': True, 'nms_thd': 0.5, 'q2c_alpha': 20, 'max_vcmr_video': 100, 'full_eval_tasks': ['VCMR', 'SVMR', 'VR'
    ], 'min_pred_l': 2, 'max_pred_l': 16, 'sub_txt_db': '/txt/tv_subtitles.db', 'vfeat_db': '/video/tv', 'train_query_txt_db': '/txt/tvr_train.db', 'val_query_txt_db': '/txt/tvr_val.db', 'test_query_txt_db': None, 'vcmr_eval_batch_size': 80, 'rank': 0, 'n_gpu': 1
}

restore_opts= {'model_config': 'config/hero.json', 'checkpoint': '/pretrain/hero-tv-ht100.pt', 'train_batch_size': 32, 'val_batch_size': 20, 'gradient_accumulation_steps': 2, 'learning_rate': 0.0001, 'valid_steps': 200, 'save_steps': 200, 'optim': 'adamw', 'betas': [
        0.9,
        0.98
    ], 'dropout': 0.1, 'weight_decay': 0.01, 'grad_norm': 1.0, 'warmup_steps': 500, 'lr_mul': 1.0, 'num_train_steps': 5000, 'output_dir': '/storage/linjie_saved_results/release_debug/tvr_default', 'sub_ctx_len': 0, 'max_clip_len': 100, 'max_txt_len': 60, 'vfeat_version': 'resnet_slowfast', 'vfeat_interval': 1.5, 'compressed_db': False, 'seed': 77, 'n_workers': 4, 'pin_mem': True, 'fp16': True, 'task': 'tvr', 'vcmr_eval_video_batch_size': 50, 'vcmr_eval_q_batch_size': 80, 'drop_svmr_prob': 0.8, 'lw_neg_q': 8.0, 'lw_neg_ctx': 8.0, 'lw_st_ed': 0.01, 'ranking_loss_type': 'hinge', 'margin': 0.1, 'hard_pool_size': [
        20
    ], 'hard_neg_weights': [
        10
    ], 'hard_negtiave_start_step': [
        2000
    ], 'train_span_start_step': 0, 'use_all_neg': True, 'eval_with_query_type': True, 'max_before_nms': 200, 'max_after_nms': 100, 'distributed_eval': True, 'nms_thd': 0.5, 'q2c_alpha': 20, 'max_vcmr_video': 100, 'full_eval_tasks': ['VCMR', 'SVMR', 'VR'
    ], 'min_pred_l': 2, 'max_pred_l': 16, 'tasks': 'tvr', 'sub_txt_db': '/txt/tv_subtitles.db', 'vfeat_db': '/video/tv', 'train_query_txt_db': '/txt/tvr_train.db', 'val_query_txt_db': '/txt/tvr_val.db', 'drop_sub_prob': 0, 'vcmr_eval_batch_size': 80, 'rank': 0, 'n_gpu': 8
}

And then, I just changed the contents of these two files $PATH_TO_STORAGE/finetune/tvr_default/log/hps.json and config/train-tvr-8gpu.json to fix this error:

# store_temp/finetune/tvr_default/log/hps.json
{
    "model_config": "config/hero_finetune.json",
    "checkpoint": "/pretrain/hero-tv-ht100.pt",
    "train_batch_size": 32,
    "val_batch_size": 20,
    "gradient_accumulation_steps": 2,
    "learning_rate": 0.0001,
    "valid_steps": 200,
    "save_steps": 200,
    "optim": "adamw",
    "betas": [
        0.9,
        0.98
    ],
    "dropout": 0.1,
    "weight_decay": 0.01,
    "grad_norm": 1.0,
    "warmup_steps": 500,
    "lr_mul": 1.0,
    "num_train_steps": 5000,
    "output_dir": "/storage/tvr_default",
    "sub_ctx_len": 0,
    "max_clip_len": 100,
    "max_txt_len": 60,
    "vfeat_version": "resnet_slowfast",
    "vfeat_interval": 1.5,
    "compressed_db": false,
    "seed": 77,
    "n_workers": 4,
    "pin_mem": true,
    "fp16": true,
    "task": "tvr",
    "vcmr_eval_video_batch_size": 50,
    "vcmr_eval_q_batch_size": 80,
    "drop_svmr_prob": 0.8,
    "lw_neg_q": 8.0,
    "lw_neg_ctx": 8.0,
    "lw_st_ed": 0.01,
    "ranking_loss_type": "hinge",
    "margin": 0.1,
    "hard_pool_size": [
        20
    ],
    "hard_neg_weights": [
        10
    ],
    "hard_negtiave_start_step": [
        2000
    ],
    "train_span_start_step": 0,
    "use_all_neg": true,
    "eval_with_query_type": true,
    "max_before_nms": 200,
    "max_after_nms": 100,
    "distributed_eval": true,
    "nms_thd": 0.5,
    "q2c_alpha": 20,
    "max_vcmr_video": 100,
    "full_eval_tasks": [
        "VCMR",
        "SVMR",
        "VR"
    ],
    "min_pred_l": 2,
    "max_pred_l": 16,
    "sub_txt_db": "/txt/tv_subtitles.db",
    "vfeat_db": "/video/tv",
    "train_query_txt_db": "/txt/tvr_train.db",
    "val_query_txt_db": "/txt/tvr_val.db",
    "test_query_txt_db": null,
    "vcmr_eval_batch_size": 80,
    "rank": 0,
    "tasks": "tvr",
    "drop_sub_prob": 0,
    "n_gpu": 1
}

# config/train-tvr-8gpu.json
{
    "task": "tvr",
    "sub_txt_db": "/txt/tv_subtitles.db",
    "vfeat_db": "/video/tv",
    "train_query_txt_db": "/txt/tvr_train.db",
    "val_query_txt_db": "/txt/tvr_val.db",
    "test_query_txt_db": null,
    "compressed_db": false,
    "model_config": "config/hero_finetune.json",
    "checkpoint": "/pretrain/hero-tv-ht100.pt",
    "output_dir": "/storage/tvr_default",
    "eval_with_query_type": true,
    "max_before_nms": 200,
    "max_after_nms": 100,
    "distributed_eval": true,
    "nms_thd": 0.5,
    "q2c_alpha": 20,
    "max_vcmr_video": 100,
    "full_eval_tasks": [
        "VCMR",
        "SVMR",
        "VR"
    ],
    "max_clip_len": 100,
    "max_txt_len": 60,
    "vfeat_version": "resnet_slowfast",
    "vfeat_interval": 1.5,
    "min_pred_l": 2,
    "max_pred_l": 16,
    "drop_svmr_prob": 0.8,
    "train_batch_size": 32,
    "val_batch_size": 20,
    "vcmr_eval_video_batch_size": 50,
    "vcmr_eval_batch_size": 80,
    "gradient_accumulation_steps":2,
    "learning_rate": 1e-04,
    "valid_steps": 200,
    "save_steps": 200,
    "num_train_steps": 5000,
    "optim": "adamw",
    "betas": [
        0.9,
        0.98
    ],
    "dropout": 0.1,
    "weight_decay": 0.01,
    "grad_norm": 1.0,
    "warmup_steps": 500,
    "lw_neg_q": 8.0,
    "lw_neg_ctx": 8.0,
    "lw_st_ed": 0.01,
    "ranking_loss_type": "hinge",
    "margin": 0.1,
    "hard_pool_size": [
        20
    ],
    "hard_neg_weights": [
        10
    ],
    "hard_negtiave_start_step": [
        2000
    ],
    "train_span_start_step": 0,
    "sub_ctx_len": 0,
    "use_all_neg": true,
    "seed": 77,
    "fp16": true,
    "n_workers": 4,
    "pin_mem": true,
    "rank": 0,
    "tasks": "tvr",
    "drop_sub_prob": 0
}

Is this the correct way to fix this error?

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

linjieli222commented, Dec 14, 2021

FYI, the checking of assert vars(opts) == restore_opts is just to make sure that you do not overwrite an old experimental outputs that you may want to keep.

If you do not want to keep the old results, you can simply delete /storage/tvr_default to bypass the assertion.

We recommend to give a new output_dir for each new experiment you are running.

0reactions

HenryHZYcommented, Dec 17, 2021

I would like to temporarily close this issue, and reopen it if there are any other questions later, thanks again.

Top Results From Across the Web

Finetuning on several tasks · Issue #2363 · huggingface ...

I tried this and got the following error: RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification: size mismatch for ...

Errors when fine-tuning T5 - Beginners

Hi everyone, I'm trying to fine-tune a T5 model. ... Looking at a few lines before the error, I see input_shape is just...

Video Retrieval

In this paper, a novel manifold ranking algorithm is proposed based on the hypergraphs for unsupervised multimedia retrieval tasks. 40. 01 Dec 2019....

A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Accordingly, Video Corpus Moment Retrieval (VCMR) [13] is a task to localize a moment in large The associate editor coordinating the review ...

VL-Adapter: Parameter-Efficient Transfer Learning for ...

Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks as well as on ...