how to replicate the wmt16 ende translation result on transformer big model?
See original GitHub issueHi, i try to replicate the WMT16 en-de translation result on transformer Big model. i try to preprocess the data with this:
TEXT=./examples/translation/wmt16_en_de
fairseq-preprocess --source-lang en --target-lang de
--trainpref $TEXT/train.tok.clean.bpe.32000
--validpref $TEXT/newstest2013.tok.bpe.32000
--testpref $TEXT/newstest2014.tok.bpe.32000
--destdir data-bin/wmt16_en_de_bpe32k
--nwordssrc 32768 --nwordstgt 32768
--joined-dictionary --worker 8
Then I train the model with this:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train
data-bin/wmt16_en_de_bpe32k
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000
--lr 0.0005 --min-lr 1e-09
--dropout 0.3 --weight-decay 0.0
--criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 3584 --fp16
--no-progress-bar --save-interval-updates 1000 --keep-interval-updates 20
--save-dir checkpoints/transformer_vaswani_en_de_big --max-epoch 15
to generate the result and calculate the BLEU with this:
CUDA_VISIBLE_DEVICES="0" python ../generate.py
data-bin/wmt16_en_de_bpe32k
--path $1
--remove-bpe --beam 4 --batch-size 64 --lenpen 0.6
--max-len-a 1 --max-len-b 50 |tee ./tmp/gen.out
Compute BLEU score
grep ^H ./tmp/gen.out | cut -f3- > ./tmp/gen.out.sys
grep ^T ./tmp/gen.out | cut -f2- > ./tmp/gen.out.ref
fairseq-score --sys ./tmp/gen.out.sys --ref ./tmp/gen.out.ref
However, after about 15 epoches, I got : | Translated 3003 sentences (86132 tokens) in 29.5s (101.93 sentences/s, 2923.70 tokens/s) | Generate test with beam=4: BLEU4 = 26.59, 57.8/32.3/20.3/13.2 (BP=1.000, ratio=1.021, syslen=64408, reflen=63078) Namespace(ignore_case=False, order=4, ref=‘./tmp/gen.out.ref’, sacrebleu=False, sys=‘./tmp/gen.out.sys’) BLEU4 = 26.59, 57.8/32.3/20.3/13.2 (BP=1.000, ratio=1.021, syslen=64408, reflen=63078)
Far away from the result in “attention is all you need”(28.4), and the model training takes almost a full day. Can you think of any reasons for this?
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
Can you try
--ddp-backend=no_c10d
? This version may behave a bit better when combined with large update-freq settings.Hi,
I got the same situation when running the big models. I am using the newest version of fairseq. What can be the reason?