question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

An error during finetuning for the TVR task

See original GitHub issue

@linjieli222 Hi, I just encountered an error in Quick Start Step3 using 1 GPU:

# inside the container
CUDA_VISIBLE_DEVICES = 0
horovodrun -np 1 python train_vcmr.py --config config/train-tvr-8gpu.json
...
...
[1,0]<stderr>:12/13/2021 09:08:05 - INFO - model.model -        Decoder Transformer config: None
[1,0]<stderr>:12/13/2021 09:08:08 - INFO - model.modeling_utils -   Weights of HeroForVcmr not initialized from pretrained model: ['v_encoder.fom_output.linear_1.weight', 'v_encoder.fom_output.linear_1.bias', 'v_encoder.fom_output.LayerNorm.weight', 'v_encoder.fom_output.LayerNorm.bias', 'v_encoder.fom_output.linear_2.weight', 'v_encoder.fom_output.linear_2.bias']
[1,0]<stderr>:12/13/2021 09:08:08 - INFO - model.modeling_utils -   Weights from pretrained model not used in HeroForVcmr: ['vocab_padded', 'v_encoder.fr_output.linear_1.weight', 'v_encoder.fr_output.linear_1.bias', 'v_encoder.fr_output.LayerNorm.weight', 'v_encoder.fr_output.LayerNorm.bias', 'v_encoder.fr_output.linear_2.weight', 'v_encoder.fr_output.linear_2.bias', 'v_encoder.itm_clip_transform.linear_1.weight', 'v_encoder.itm_clip_transform.linear_1.bias', 'v_encoder.itm_clip_transform.LayerNorm.weight', 'v_encoder.itm_clip_transform.LayerNorm.bias', 'v_encoder.itm_clip_transform.linear_2.weight', 'v_encoder.itm_clip_transform.linear_2.bias', 'v_encoder.itm_sub_transform.linear_1.weight', 'v_encoder.itm_sub_transform.linear_1.bias', 'v_encoder.itm_sub_transform.LayerNorm.weight', 'v_encoder.itm_sub_transform.LayerNorm.bias', 'v_encoder.itm_sub_transform.linear_2.weight', 'v_encoder.itm_sub_transform.linear_2.bias']
[1,0]<stdout>:Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.
[1,0]<stdout>:
[1,0]<stdout>:Defaults for this optimization level are:
[1,0]<stdout>:enabled                : True
[1,0]<stdout>:opt_level              : O2
[1,0]<stdout>:cast_model_type        : torch.float16
[1,0]<stdout>:patch_torch_functions  : False
[1,0]<stdout>:keep_batchnorm_fp32    : True
[1,0]<stdout>:master_weights         : True
[1,0]<stdout>:loss_scale             : dynamic
[1,0]<stdout>:Processing user overrides (additional kwargs that are not None)...
[1,0]<stdout>:After processing overrides, optimization options are:
[1,0]<stdout>:enabled                : True
[1,0]<stdout>:opt_level              : O2
[1,0]<stdout>:cast_model_type        : torch.float16
[1,0]<stdout>:patch_torch_functions  : False
[1,0]<stdout>:keep_batchnorm_fp32    : True
[1,0]<stdout>:master_weights         : True
[1,0]<stdout>:loss_scale             : dynamic
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "train_vcmr.py", line 399, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "train_vcmr.py", line 161, in main
[1,0]<stderr>:    restorer = TrainingRestorer(opts, model, optimizer)
[1,0]<stderr>:  File "/src/utils/save.py", line 141, in __init__
[1,0]<stderr>:    assert vars(opts) == restore_opts
[1,0]<stderr>:AssertionError
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30056,1],0]
  Exit code:    1
--------------------------------------------------------------------------

It seems to be caused by vars(opts) and restore_opts:

vars(opts)= {'model_config': 'config/hero_finetune.json', 'checkpoint': '/pretrain/hero-tv-ht100.pt', 'train_batch_size': 32, 'val_batch_size': 20, 'gradient_accumulation_steps': 2, 'learning_rate': 0.0001, 'valid_steps': 200, 'save_steps': 200, 'optim': 'adamw', 'betas': [
        0.9,
        0.98
    ], 'dropout': 0.1, 'weight_decay': 0.01, 'grad_norm': 1.0, 'warmup_steps': 500, 'lr_mul': 1.0, 'num_train_steps': 5000, 'output_dir': '/storage/tvr_default', 'sub_ctx_len': 0, 'max_clip_len': 100, 'max_txt_len': 60, 'vfeat_version': 'resnet_slowfast', 'vfeat_interval': 1.5, 'compressed_db': False, 'seed': 77, 'n_workers': 4, 'pin_mem': True, 'fp16': True, 'task': 'tvr', 'vcmr_eval_video_batch_size': 50, 'vcmr_eval_q_batch_size': 80, 'drop_svmr_prob': 0.8, 'lw_neg_q': 8.0, 'lw_neg_ctx': 8.0, 'lw_st_ed': 0.01, 'ranking_loss_type': 'hinge', 'margin': 0.1, 'hard_pool_size': [
        20
    ], 'hard_neg_weights': [
        10
    ], 'hard_negtiave_start_step': [
        2000
    ], 'train_span_start_step': 0, 'use_all_neg': True, 'eval_with_query_type': True, 'max_before_nms': 200, 'max_after_nms': 100, 'distributed_eval': True, 'nms_thd': 0.5, 'q2c_alpha': 20, 'max_vcmr_video': 100, 'full_eval_tasks': ['VCMR', 'SVMR', 'VR'
    ], 'min_pred_l': 2, 'max_pred_l': 16, 'sub_txt_db': '/txt/tv_subtitles.db', 'vfeat_db': '/video/tv', 'train_query_txt_db': '/txt/tvr_train.db', 'val_query_txt_db': '/txt/tvr_val.db', 'test_query_txt_db': None, 'vcmr_eval_batch_size': 80, 'rank': 0, 'n_gpu': 1
}

restore_opts= {'model_config': 'config/hero.json', 'checkpoint': '/pretrain/hero-tv-ht100.pt', 'train_batch_size': 32, 'val_batch_size': 20, 'gradient_accumulation_steps': 2, 'learning_rate': 0.0001, 'valid_steps': 200, 'save_steps': 200, 'optim': 'adamw', 'betas': [
        0.9,
        0.98
    ], 'dropout': 0.1, 'weight_decay': 0.01, 'grad_norm': 1.0, 'warmup_steps': 500, 'lr_mul': 1.0, 'num_train_steps': 5000, 'output_dir': '/storage/linjie_saved_results/release_debug/tvr_default', 'sub_ctx_len': 0, 'max_clip_len': 100, 'max_txt_len': 60, 'vfeat_version': 'resnet_slowfast', 'vfeat_interval': 1.5, 'compressed_db': False, 'seed': 77, 'n_workers': 4, 'pin_mem': True, 'fp16': True, 'task': 'tvr', 'vcmr_eval_video_batch_size': 50, 'vcmr_eval_q_batch_size': 80, 'drop_svmr_prob': 0.8, 'lw_neg_q': 8.0, 'lw_neg_ctx': 8.0, 'lw_st_ed': 0.01, 'ranking_loss_type': 'hinge', 'margin': 0.1, 'hard_pool_size': [
        20
    ], 'hard_neg_weights': [
        10
    ], 'hard_negtiave_start_step': [
        2000
    ], 'train_span_start_step': 0, 'use_all_neg': True, 'eval_with_query_type': True, 'max_before_nms': 200, 'max_after_nms': 100, 'distributed_eval': True, 'nms_thd': 0.5, 'q2c_alpha': 20, 'max_vcmr_video': 100, 'full_eval_tasks': ['VCMR', 'SVMR', 'VR'
    ], 'min_pred_l': 2, 'max_pred_l': 16, 'tasks': 'tvr', 'sub_txt_db': '/txt/tv_subtitles.db', 'vfeat_db': '/video/tv', 'train_query_txt_db': '/txt/tvr_train.db', 'val_query_txt_db': '/txt/tvr_val.db', 'drop_sub_prob': 0, 'vcmr_eval_batch_size': 80, 'rank': 0, 'n_gpu': 8
}

And then, I just changed the contents of these two files $PATH_TO_STORAGE/finetune/tvr_default/log/hps.json and config/train-tvr-8gpu.json to fix this error:

# store_temp/finetune/tvr_default/log/hps.json
{
    "model_config": "config/hero_finetune.json",
    "checkpoint": "/pretrain/hero-tv-ht100.pt",
    "train_batch_size": 32,
    "val_batch_size": 20,
    "gradient_accumulation_steps": 2,
    "learning_rate": 0.0001,
    "valid_steps": 200,
    "save_steps": 200,
    "optim": "adamw",
    "betas": [
        0.9,
        0.98
    ],
    "dropout": 0.1,
    "weight_decay": 0.01,
    "grad_norm": 1.0,
    "warmup_steps": 500,
    "lr_mul": 1.0,
    "num_train_steps": 5000,
    "output_dir": "/storage/tvr_default",
    "sub_ctx_len": 0,
    "max_clip_len": 100,
    "max_txt_len": 60,
    "vfeat_version": "resnet_slowfast",
    "vfeat_interval": 1.5,
    "compressed_db": false,
    "seed": 77,
    "n_workers": 4,
    "pin_mem": true,
    "fp16": true,
    "task": "tvr",
    "vcmr_eval_video_batch_size": 50,
    "vcmr_eval_q_batch_size": 80,
    "drop_svmr_prob": 0.8,
    "lw_neg_q": 8.0,
    "lw_neg_ctx": 8.0,
    "lw_st_ed": 0.01,
    "ranking_loss_type": "hinge",
    "margin": 0.1,
    "hard_pool_size": [
        20
    ],
    "hard_neg_weights": [
        10
    ],
    "hard_negtiave_start_step": [
        2000
    ],
    "train_span_start_step": 0,
    "use_all_neg": true,
    "eval_with_query_type": true,
    "max_before_nms": 200,
    "max_after_nms": 100,
    "distributed_eval": true,
    "nms_thd": 0.5,
    "q2c_alpha": 20,
    "max_vcmr_video": 100,
    "full_eval_tasks": [
        "VCMR",
        "SVMR",
        "VR"
    ],
    "min_pred_l": 2,
    "max_pred_l": 16,
    "sub_txt_db": "/txt/tv_subtitles.db",
    "vfeat_db": "/video/tv",
    "train_query_txt_db": "/txt/tvr_train.db",
    "val_query_txt_db": "/txt/tvr_val.db",
    "test_query_txt_db": null,
    "vcmr_eval_batch_size": 80,
    "rank": 0,
    "tasks": "tvr",
    "drop_sub_prob": 0,
    "n_gpu": 1
}
# config/train-tvr-8gpu.json
{
    "task": "tvr",
    "sub_txt_db": "/txt/tv_subtitles.db",
    "vfeat_db": "/video/tv",
    "train_query_txt_db": "/txt/tvr_train.db",
    "val_query_txt_db": "/txt/tvr_val.db",
    "test_query_txt_db": null,
    "compressed_db": false,
    "model_config": "config/hero_finetune.json",
    "checkpoint": "/pretrain/hero-tv-ht100.pt",
    "output_dir": "/storage/tvr_default",
    "eval_with_query_type": true,
    "max_before_nms": 200,
    "max_after_nms": 100,
    "distributed_eval": true,
    "nms_thd": 0.5,
    "q2c_alpha": 20,
    "max_vcmr_video": 100,
    "full_eval_tasks": [
        "VCMR",
        "SVMR",
        "VR"
    ],
    "max_clip_len": 100,
    "max_txt_len": 60,
    "vfeat_version": "resnet_slowfast",
    "vfeat_interval": 1.5,
    "min_pred_l": 2,
    "max_pred_l": 16,
    "drop_svmr_prob": 0.8,
    "train_batch_size": 32,
    "val_batch_size": 20,
    "vcmr_eval_video_batch_size": 50,
    "vcmr_eval_batch_size": 80,
    "gradient_accumulation_steps":2,
    "learning_rate": 1e-04,
    "valid_steps": 200,
    "save_steps": 200,
    "num_train_steps": 5000,
    "optim": "adamw",
    "betas": [
        0.9,
        0.98
    ],
    "dropout": 0.1,
    "weight_decay": 0.01,
    "grad_norm": 1.0,
    "warmup_steps": 500,
    "lw_neg_q": 8.0,
    "lw_neg_ctx": 8.0,
    "lw_st_ed": 0.01,
    "ranking_loss_type": "hinge",
    "margin": 0.1,
    "hard_pool_size": [
        20
    ],
    "hard_neg_weights": [
        10
    ],
    "hard_negtiave_start_step": [
        2000
    ],
    "train_span_start_step": 0,
    "sub_ctx_len": 0,
    "use_all_neg": true,
    "seed": 77,
    "fp16": true,
    "n_workers": 4,
    "pin_mem": true,
    "rank": 0,
    "tasks": "tvr",
    "drop_sub_prob": 0
}

Is this the correct way to fix this error?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
linjieli222commented, Dec 14, 2021

FYI, the checking of assert vars(opts) == restore_opts is just to make sure that you do not overwrite an old experimental outputs that you may want to keep.

If you do not want to keep the old results, you can simply delete /storage/tvr_default to bypass the assertion.

We recommend to give a new output_dir for each new experiment you are running.

0reactions
HenryHZYcommented, Dec 17, 2021

I would like to temporarily close this issue, and reopen it if there are any other questions later, thanks again.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Finetuning on several tasks · Issue #2363 · huggingface ...
I tried this and got the following error: RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification: size mismatch for ...
Read more >
Errors when fine-tuning T5 - Beginners
Hi everyone, I'm trying to fine-tune a T5 model. ... Looking at a few lines before the error, I see input_shape is just...
Read more >
Video Retrieval
In this paper, a novel manifold ranking algorithm is proposed based on the hypergraphs for unsupervised multimedia retrieval tasks. 40. 01 Dec 2019....
Read more >
A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Accordingly, Video Corpus Moment Retrieval (VCMR) [13] is a task to localize a moment in large The associate editor coordinating the review ...
Read more >
VL-Adapter: Parameter-Efficient Transfer Learning for ...
Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks as well as on ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found