Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to replicate the wmt16 ende translation result on transformer big model?

See original GitHub issue

Hi, i try to replicate the WMT16 en-de translation result on transformer Big model. i try to preprocess the data with this:

TEXT=./examples/translation/wmt16_en_de
fairseq-preprocess --source-lang en --target-lang de 
--trainpref $TEXT/train.tok.clean.bpe.32000 
--validpref $TEXT/newstest2013.tok.bpe.32000 
--testpref $TEXT/newstest2014.tok.bpe.32000 
--destdir data-bin/wmt16_en_de_bpe32k 
--nwordssrc 32768 --nwordstgt 32768 
--joined-dictionary --worker 8

Then I train the model with this:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train 
data-bin/wmt16_en_de_bpe32k 
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings 
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 
--lr 0.0005 --min-lr 1e-09 
--dropout 0.3 --weight-decay 0.0 
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 
--max-tokens 3584 --fp16 
--no-progress-bar --save-interval-updates 1000 --keep-interval-updates 20 
--save-dir checkpoints/transformer_vaswani_en_de_big --max-epoch 15

to generate the result and calculate the BLEU with this:

CUDA_VISIBLE_DEVICES="0" python ../generate.py 
data-bin/wmt16_en_de_bpe32k 
--path $1 
--remove-bpe --beam 4 --batch-size 64 --lenpen 0.6 
--max-len-a 1 --max-len-b 50 |tee ./tmp/gen.out

Compute BLEU score

grep ^H ./tmp/gen.out | cut -f3- > ./tmp/gen.out.sys
grep ^T ./tmp/gen.out | cut -f2- > ./tmp/gen.out.ref
fairseq-score --sys ./tmp/gen.out.sys --ref ./tmp/gen.out.ref

However, after about 15 epoches, I got : | Translated 3003 sentences (86132 tokens) in 29.5s (101.93 sentences/s, 2923.70 tokens/s) | Generate test with beam=4: BLEU4 = 26.59, 57.8/32.3/20.3/13.2 (BP=1.000, ratio=1.021, syslen=64408, reflen=63078) Namespace(ignore_case=False, order=4, ref=‘./tmp/gen.out.ref’, sacrebleu=False, sys=‘./tmp/gen.out.sys’) BLEU4 = 26.59, 57.8/32.3/20.3/13.2 (BP=1.000, ratio=1.021, syslen=64408, reflen=63078)

Far away from the result in “attention is all you need”(28.4), and the model training takes almost a full day. Can you think of any reasons for this?

Issue Analytics

State:
Created 4 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

myleottcommented, Apr 24, 2019

Can you try --ddp-backend=no_c10d? This version may behave a bit better when combined with large update-freq settings.

0reactions

nxphi47commented, Oct 24, 2019

Hi,

I got the same situation when running the big models. I am using the newest version of fairseq. What can be the reason?

@myleott hi, i try to replicate the result of ‘+ 2x lr’ on transformer big and wmt16. At the begining, the loss and nll_loss reduced as usual(updates:0->3057), but the loss and nll_loss increased constantly(updates:3058->9600).

| WARNING: overflow detected, setting loss scale to: 8.0
| epoch 006 | valid on 'valid' subset | loss 4.878 | nll_loss 3.103 | ppl 8.59 | num_updates 2000 | best_loss 4.87848
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint_6_2000.pt (epoch 6 @ 2000 updates) (writing took 7.748462915420532 seconds)
| epoch 006 | loss 5.375 | nll_loss 3.851 | ppl 14.43 | wps 105156 | ups 0 | wpb 402417.835 | bsz 13150.676 | num_updates 2036 | lr 0.000509049 | gnorm 0.475 | clip 0.000 | oom 0.000 | loss
| epoch 006 | valid on 'valid' subset | loss 4.821 | nll_loss 3.064 | ppl 8.36 | num_updates 2036 | best_loss 4.82093
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint6.pt (epoch 6 @ 2036 updates) (writing took 7.87086820602417 seconds)
| WARNING: overflow detected, setting loss scale to: 8.0
| epoch 007 | loss 5.070 | nll_loss 3.517 | ppl 11.45 | wps 105713 | ups 0 | wpb 402428.516 | bsz 13155.935 | num_updates 2377 | lr 0.000594291 | gnorm 0.387 | clip 0.000 | oom 0.000 | loss
| epoch 007 | valid on 'valid' subset | loss 4.629 | nll_loss 2.850 | ppl 7.21 | num_updates 2377 | best_loss 4.62924
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint7.pt (epoch 7 @ 2377 updates) (writing took 7.967353820800781 seconds)
| WARNING: overflow detected, setting loss scale to: 16.0 
| WARNING: overflow detected, setting loss scale to: 8.0
| epoch 008 | loss 4.875 | nll_loss 3.306 | ppl 9.89 | wps 105795 | ups 0 | wpb 402477.000 | bsz 13161.029 | num_updates 2717 | lr 0.000679282 | gnorm 0.355 | clip 0.000 | oom 0.000 | loss_
| epoch 008 | valid on 'valid' subset | loss 4.449 | nll_loss 2.677 | ppl 6.39 | num_updates 2717 | best_loss 4.44949
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint8.pt (epoch 8 @ 2717 updates) (writing took 7.882242679595947 seconds)
| WARNING: overflow detected, setting loss scale to: 8.0
| WARNING: overflow detected, setting loss scale to: 8.0
| epoch 009 | valid on 'valid' subset | loss 4.338 | nll_loss 2.567 | ppl 5.93 | num_updates 3000 | best_loss 4.33773
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint_9_3000.pt (epoch 9 @ 3000 updates) (writing took 7.749082565307617 seconds)
| epoch 009 | loss 4.744 | nll_loss 3.164 | ppl 8.97 | wps 105280 | ups 0 | wpb 402417.374 | bsz 13154.794 | num_updates 3057 | lr 0.000764274 | gnorm 0.346 | clip 0.000 | oom 0.000 | loss_
| epoch 009 | valid on 'valid' subset | loss 4.329 | nll_loss 2.558 | ppl 5.89 | num_updates 3057 | best_loss 4.32936
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint9.pt (epoch 9 @ 3057 updates) (writing took 7.789773464202881 seconds)
| WARNING: overflow detected, setting loss scale to: 16.0 
| WARNING: overflow detected, setting loss scale to: 8.0
| WARNING: overflow detected, setting loss scale to: 4.0
| WARNING: overflow detected, setting loss scale to: 2.0
| WARNING: overflow detected, setting loss scale to: 1.0
| WARNING: overflow detected, setting loss scale to: 0.5
| WARNING: overflow detected, setting loss scale to: 0.25 
| epoch 010 | loss 7.143 | nll_loss 5.919 | ppl 60.49 | wps 103749 | ups 0 | wpb 402502.988 | bsz 13146.860 | num_updates 3392 | lr 0.000848015 | gnorm 0.726 | clip 0.000 | oom 0.000 | loss
| epoch 010 | valid on 'valid' subset | loss 11.788 | nll_loss 11.226 | ppl 2395.81 | num_updates 3392 | best_loss 4.32936
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint10.pt (epoch 10 @ 3392 updates) (writing took 4.796957731246948 seconds)
| epoch 011 | loss 11.626 | nll_loss 11.070 | ppl 2149.65 | wps 106200 | ups 0 | wpb 402421.772 | bsz 13160.719 | num_updates 3734 | lr 0.000933507 | gnorm 0.480 | clip 0.000 | oom 0.000 | 
| epoch 011 | valid on 'valid' subset | loss 14.099 | nll_loss 13.758 | ppl 13858.17 | num_updates 3734 | best_loss 4.32936
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint11.pt (epoch 11 @ 3734 updates) (writing took 4.8482091426849365 seconds)
| WARNING: overflow detected, setting loss scale to: 2.0
| WARNING: overflow detected, setting loss scale to: 1.0
| WARNING: overflow detected, setting loss scale to: 0.5
| epoch 012 | valid on 'valid' subset | loss 14.852 | nll_loss 14.363 | ppl 21064.62 | num_updates 4000 | best_loss 4.32936
| saved checkpoint checkpoints/transformer_vaswani16_ende_big/checkpoint_12_4000.pt (epoch 12 @ 4000 updates) (writing took 4.745401382446289 seconds)
| epoch 012 | loss 11.545 | nll_loss 10.977 | ppl 2015.82 | wps 104868 | ups 0 | wpb 402440.389 | bsz 13164.761 | num_updates 4073 | lr 0.000990998 | gnorm 0.604 | clip 0.000 | oom 0.000 | 
| epoch 012 | valid on 'valid' subset | loss 15.443 | nll_loss 15.009 | ppl 32969.28 | num_updates 4073 | best_loss 4.32936

I trained the model with this:

# Train a model:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train \
     data-bin/wmt16_en_de_bpe32k \
     --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
     --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
     --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
     --min-lr 1e-09 \
     --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy \
     --label-smoothing 0.1 \
     --max-tokens 3584 \
     --fp16 \
     --lr 1e-03 \
     --update-freq 16 \
     --max-update 9600 \
     --no-progress-bar --save-interval-updates 1000 --keep-interval-updates 20 \
     --save-dir checkpoints/transformer_wmt16_ende_big \

Please help me handle this question, thank you !

Top Results From Across the Web

how to replicate the wmt16 ende translation result on ... - GitHub

Hi, i try to replicate the WMT16 en-de translation result on transformer Big model. i try to preprocess the data with this: TEXT=....

Porting fairseq wmt19 translation system to transformers

A guest blog post by Stas Bekman. This article is an attempt to document how fairseq wmt19 translation system was ported to transformers...

Why is the WMT16 dataset favoured for evaluating machine ...

The original Transformer paper presented the results on WMT14 data and since that time people tend to compare their results with them. Also...

Learning Deep Transformer Models for Machine Translation

We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a...

Tutorial - AutoSimTrans 2020

We will dive head-first into training a transformer model from scratch using a TensorFlow GPU Docker image. Step 1) Launch TensorFlow GPU Docker ......