Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to reproduce the 100k results

See original GitHub issue

Dear Authors, Thanks for open sourcing the code. I tried pretrain 100k steps and finetune on vqav2, but my dev-test score is about 65, unlike the 70.8 on the paper.

Here is my pretrain and finetune command

python run.py with data_root=vilt_dataset/ \
	num_gpus=8 num_nodes=8 task_mlm_itm whole_word_masking=True step100k \
	per_gpu_batchsize=64 exp_name=pretrain

python run.py with data_root=vilt_dataset/ \
	num_gpus=8 num_nodes=1 task_finetune_vqa_randaug \
	per_gpu_batchsize=32 load_path="result/pretrain_seed0_from_/version_0/checkpoints/last.ckpt" \
	exp_name=vqa_finetune

Generate JSON with

python run.py with data_root=vilt_dataset/ \
	num_gpus=4 num_nodes=1 task_finetune_vqa \
	per_gpu_batchsize=256 load_path="result/vqa_finetune_seed0_from_last/version_0/checkpoints/last.ckpt" \
	test_only=True  exp_name="test_vqa"

here is my pretraining and finetuning tb log Screen Shot 2021-06-10 at 6 34 22 PM Screen Shot 2021-06-10 at 6 35 14 PM

Issue Analytics

State:
Created 2 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

dandelincommented, Jun 11, 2021

@JACKHAHA363 Thank you for your report. After carefully comparing the published (cleaned) version and our interval version of the source code, we found that we did joint training of pretraining losses in the internal version, which is done alternatively in the cleaned version.

I patched the code to do the joint training (https://github.com/dandelin/ViLT/commit/98a51e6058b1bcdd98ee6628ceacdd1c7325525f), please try with this version. Sorry for our mistake, the alternative training will need more iterations to converge.

1reaction

dandelincommented, Jun 11, 2021

@JACKHAHA363 Those two need different inputs. For ITM, we use unmasked inputs (and also misaligned image-text pair). So an iteration requires running the transformer three times: aligned masked text + image for MLM, aligned unmasked text + image and misaligned unmasked text + image for ITM.