An error during finetuning for the TVR task
See original GitHub issue@linjieli222
Hi, I just encountered an error in Quick Start Step3
using 1 GPU:
# inside the container
CUDA_VISIBLE_DEVICES = 0
horovodrun -np 1 python train_vcmr.py --config config/train-tvr-8gpu.json
...
...
[1,0]<stderr>:12/13/2021 09:08:05 - INFO - model.model - Decoder Transformer config: None
[1,0]<stderr>:12/13/2021 09:08:08 - INFO - model.modeling_utils - Weights of HeroForVcmr not initialized from pretrained model: ['v_encoder.fom_output.linear_1.weight', 'v_encoder.fom_output.linear_1.bias', 'v_encoder.fom_output.LayerNorm.weight', 'v_encoder.fom_output.LayerNorm.bias', 'v_encoder.fom_output.linear_2.weight', 'v_encoder.fom_output.linear_2.bias']
[1,0]<stderr>:12/13/2021 09:08:08 - INFO - model.modeling_utils - Weights from pretrained model not used in HeroForVcmr: ['vocab_padded', 'v_encoder.fr_output.linear_1.weight', 'v_encoder.fr_output.linear_1.bias', 'v_encoder.fr_output.LayerNorm.weight', 'v_encoder.fr_output.LayerNorm.bias', 'v_encoder.fr_output.linear_2.weight', 'v_encoder.fr_output.linear_2.bias', 'v_encoder.itm_clip_transform.linear_1.weight', 'v_encoder.itm_clip_transform.linear_1.bias', 'v_encoder.itm_clip_transform.LayerNorm.weight', 'v_encoder.itm_clip_transform.LayerNorm.bias', 'v_encoder.itm_clip_transform.linear_2.weight', 'v_encoder.itm_clip_transform.linear_2.bias', 'v_encoder.itm_sub_transform.linear_1.weight', 'v_encoder.itm_sub_transform.linear_1.bias', 'v_encoder.itm_sub_transform.LayerNorm.weight', 'v_encoder.itm_sub_transform.LayerNorm.bias', 'v_encoder.itm_sub_transform.linear_2.weight', 'v_encoder.itm_sub_transform.linear_2.bias']
[1,0]<stdout>:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
[1,0]<stdout>:
[1,0]<stdout>:Defaults for this optimization level are:
[1,0]<stdout>:enabled : True
[1,0]<stdout>:opt_level : O2
[1,0]<stdout>:cast_model_type : torch.float16
[1,0]<stdout>:patch_torch_functions : False
[1,0]<stdout>:keep_batchnorm_fp32 : True
[1,0]<stdout>:master_weights : True
[1,0]<stdout>:loss_scale : dynamic
[1,0]<stdout>:Processing user overrides (additional kwargs that are not None)...
[1,0]<stdout>:After processing overrides, optimization options are:
[1,0]<stdout>:enabled : True
[1,0]<stdout>:opt_level : O2
[1,0]<stdout>:cast_model_type : torch.float16
[1,0]<stdout>:patch_torch_functions : False
[1,0]<stdout>:keep_batchnorm_fp32 : True
[1,0]<stdout>:master_weights : True
[1,0]<stdout>:loss_scale : dynamic
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "train_vcmr.py", line 399, in <module>
[1,0]<stderr>: main(args)
[1,0]<stderr>: File "train_vcmr.py", line 161, in main
[1,0]<stderr>: restorer = TrainingRestorer(opts, model, optimizer)
[1,0]<stderr>: File "/src/utils/save.py", line 141, in __init__
[1,0]<stderr>: assert vars(opts) == restore_opts
[1,0]<stderr>:AssertionError
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[30056,1],0]
Exit code: 1
--------------------------------------------------------------------------
It seems to be caused by vars(opts)
and restore_opts
:
vars(opts)= {'model_config': 'config/hero_finetune.json', 'checkpoint': '/pretrain/hero-tv-ht100.pt', 'train_batch_size': 32, 'val_batch_size': 20, 'gradient_accumulation_steps': 2, 'learning_rate': 0.0001, 'valid_steps': 200, 'save_steps': 200, 'optim': 'adamw', 'betas': [
0.9,
0.98
], 'dropout': 0.1, 'weight_decay': 0.01, 'grad_norm': 1.0, 'warmup_steps': 500, 'lr_mul': 1.0, 'num_train_steps': 5000, 'output_dir': '/storage/tvr_default', 'sub_ctx_len': 0, 'max_clip_len': 100, 'max_txt_len': 60, 'vfeat_version': 'resnet_slowfast', 'vfeat_interval': 1.5, 'compressed_db': False, 'seed': 77, 'n_workers': 4, 'pin_mem': True, 'fp16': True, 'task': 'tvr', 'vcmr_eval_video_batch_size': 50, 'vcmr_eval_q_batch_size': 80, 'drop_svmr_prob': 0.8, 'lw_neg_q': 8.0, 'lw_neg_ctx': 8.0, 'lw_st_ed': 0.01, 'ranking_loss_type': 'hinge', 'margin': 0.1, 'hard_pool_size': [
20
], 'hard_neg_weights': [
10
], 'hard_negtiave_start_step': [
2000
], 'train_span_start_step': 0, 'use_all_neg': True, 'eval_with_query_type': True, 'max_before_nms': 200, 'max_after_nms': 100, 'distributed_eval': True, 'nms_thd': 0.5, 'q2c_alpha': 20, 'max_vcmr_video': 100, 'full_eval_tasks': ['VCMR', 'SVMR', 'VR'
], 'min_pred_l': 2, 'max_pred_l': 16, 'sub_txt_db': '/txt/tv_subtitles.db', 'vfeat_db': '/video/tv', 'train_query_txt_db': '/txt/tvr_train.db', 'val_query_txt_db': '/txt/tvr_val.db', 'test_query_txt_db': None, 'vcmr_eval_batch_size': 80, 'rank': 0, 'n_gpu': 1
}
restore_opts= {'model_config': 'config/hero.json', 'checkpoint': '/pretrain/hero-tv-ht100.pt', 'train_batch_size': 32, 'val_batch_size': 20, 'gradient_accumulation_steps': 2, 'learning_rate': 0.0001, 'valid_steps': 200, 'save_steps': 200, 'optim': 'adamw', 'betas': [
0.9,
0.98
], 'dropout': 0.1, 'weight_decay': 0.01, 'grad_norm': 1.0, 'warmup_steps': 500, 'lr_mul': 1.0, 'num_train_steps': 5000, 'output_dir': '/storage/linjie_saved_results/release_debug/tvr_default', 'sub_ctx_len': 0, 'max_clip_len': 100, 'max_txt_len': 60, 'vfeat_version': 'resnet_slowfast', 'vfeat_interval': 1.5, 'compressed_db': False, 'seed': 77, 'n_workers': 4, 'pin_mem': True, 'fp16': True, 'task': 'tvr', 'vcmr_eval_video_batch_size': 50, 'vcmr_eval_q_batch_size': 80, 'drop_svmr_prob': 0.8, 'lw_neg_q': 8.0, 'lw_neg_ctx': 8.0, 'lw_st_ed': 0.01, 'ranking_loss_type': 'hinge', 'margin': 0.1, 'hard_pool_size': [
20
], 'hard_neg_weights': [
10
], 'hard_negtiave_start_step': [
2000
], 'train_span_start_step': 0, 'use_all_neg': True, 'eval_with_query_type': True, 'max_before_nms': 200, 'max_after_nms': 100, 'distributed_eval': True, 'nms_thd': 0.5, 'q2c_alpha': 20, 'max_vcmr_video': 100, 'full_eval_tasks': ['VCMR', 'SVMR', 'VR'
], 'min_pred_l': 2, 'max_pred_l': 16, 'tasks': 'tvr', 'sub_txt_db': '/txt/tv_subtitles.db', 'vfeat_db': '/video/tv', 'train_query_txt_db': '/txt/tvr_train.db', 'val_query_txt_db': '/txt/tvr_val.db', 'drop_sub_prob': 0, 'vcmr_eval_batch_size': 80, 'rank': 0, 'n_gpu': 8
}
And then, I just changed the contents of these two files $PATH_TO_STORAGE/finetune/tvr_default/log/hps.json
and config/train-tvr-8gpu.json
to fix this error:
# store_temp/finetune/tvr_default/log/hps.json
{
"model_config": "config/hero_finetune.json",
"checkpoint": "/pretrain/hero-tv-ht100.pt",
"train_batch_size": 32,
"val_batch_size": 20,
"gradient_accumulation_steps": 2,
"learning_rate": 0.0001,
"valid_steps": 200,
"save_steps": 200,
"optim": "adamw",
"betas": [
0.9,
0.98
],
"dropout": 0.1,
"weight_decay": 0.01,
"grad_norm": 1.0,
"warmup_steps": 500,
"lr_mul": 1.0,
"num_train_steps": 5000,
"output_dir": "/storage/tvr_default",
"sub_ctx_len": 0,
"max_clip_len": 100,
"max_txt_len": 60,
"vfeat_version": "resnet_slowfast",
"vfeat_interval": 1.5,
"compressed_db": false,
"seed": 77,
"n_workers": 4,
"pin_mem": true,
"fp16": true,
"task": "tvr",
"vcmr_eval_video_batch_size": 50,
"vcmr_eval_q_batch_size": 80,
"drop_svmr_prob": 0.8,
"lw_neg_q": 8.0,
"lw_neg_ctx": 8.0,
"lw_st_ed": 0.01,
"ranking_loss_type": "hinge",
"margin": 0.1,
"hard_pool_size": [
20
],
"hard_neg_weights": [
10
],
"hard_negtiave_start_step": [
2000
],
"train_span_start_step": 0,
"use_all_neg": true,
"eval_with_query_type": true,
"max_before_nms": 200,
"max_after_nms": 100,
"distributed_eval": true,
"nms_thd": 0.5,
"q2c_alpha": 20,
"max_vcmr_video": 100,
"full_eval_tasks": [
"VCMR",
"SVMR",
"VR"
],
"min_pred_l": 2,
"max_pred_l": 16,
"sub_txt_db": "/txt/tv_subtitles.db",
"vfeat_db": "/video/tv",
"train_query_txt_db": "/txt/tvr_train.db",
"val_query_txt_db": "/txt/tvr_val.db",
"test_query_txt_db": null,
"vcmr_eval_batch_size": 80,
"rank": 0,
"tasks": "tvr",
"drop_sub_prob": 0,
"n_gpu": 1
}
# config/train-tvr-8gpu.json
{
"task": "tvr",
"sub_txt_db": "/txt/tv_subtitles.db",
"vfeat_db": "/video/tv",
"train_query_txt_db": "/txt/tvr_train.db",
"val_query_txt_db": "/txt/tvr_val.db",
"test_query_txt_db": null,
"compressed_db": false,
"model_config": "config/hero_finetune.json",
"checkpoint": "/pretrain/hero-tv-ht100.pt",
"output_dir": "/storage/tvr_default",
"eval_with_query_type": true,
"max_before_nms": 200,
"max_after_nms": 100,
"distributed_eval": true,
"nms_thd": 0.5,
"q2c_alpha": 20,
"max_vcmr_video": 100,
"full_eval_tasks": [
"VCMR",
"SVMR",
"VR"
],
"max_clip_len": 100,
"max_txt_len": 60,
"vfeat_version": "resnet_slowfast",
"vfeat_interval": 1.5,
"min_pred_l": 2,
"max_pred_l": 16,
"drop_svmr_prob": 0.8,
"train_batch_size": 32,
"val_batch_size": 20,
"vcmr_eval_video_batch_size": 50,
"vcmr_eval_batch_size": 80,
"gradient_accumulation_steps":2,
"learning_rate": 1e-04,
"valid_steps": 200,
"save_steps": 200,
"num_train_steps": 5000,
"optim": "adamw",
"betas": [
0.9,
0.98
],
"dropout": 0.1,
"weight_decay": 0.01,
"grad_norm": 1.0,
"warmup_steps": 500,
"lw_neg_q": 8.0,
"lw_neg_ctx": 8.0,
"lw_st_ed": 0.01,
"ranking_loss_type": "hinge",
"margin": 0.1,
"hard_pool_size": [
20
],
"hard_neg_weights": [
10
],
"hard_negtiave_start_step": [
2000
],
"train_span_start_step": 0,
"sub_ctx_len": 0,
"use_all_neg": true,
"seed": 77,
"fp16": true,
"n_workers": 4,
"pin_mem": true,
"rank": 0,
"tasks": "tvr",
"drop_sub_prob": 0
}
Is this the correct way to fix this error?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Finetuning on several tasks · Issue #2363 · huggingface ...
I tried this and got the following error: RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification: size mismatch for ...
Read more >Errors when fine-tuning T5 - Beginners
Hi everyone, I'm trying to fine-tune a T5 model. ... Looking at a few lines before the error, I see input_shape is just...
Read more >Video Retrieval
In this paper, a novel manifold ranking algorithm is proposed based on the hypergraphs for unsupervised multimedia retrieval tasks. 40. 01 Dec 2019....
Read more >A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Accordingly, Video Corpus Moment Retrieval (VCMR) [13] is a task to localize a moment in large The associate editor coordinating the review ...
Read more >VL-Adapter: Parameter-Efficient Transfer Learning for ...
Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks as well as on ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
FYI, the checking of
assert vars(opts) == restore_opts
is just to make sure that you do not overwrite an old experimental outputs that you may want to keep.If you do not want to keep the old results, you can simply delete
/storage/tvr_default
to bypass the assertion.We recommend to give a new
output_dir
for each new experiment you are running.I would like to temporarily close this issue, and reopen it if there are any other questions later, thanks again.