How to reproduce the result of WMT14 en-de on transformer BASE model?
See original GitHub issueHi
I want to replicate the WMT14 en-de translation result on transformer BASE model of the paper “attention is all you need”. Following the last instructions here, I downloaded and preprocessed the data. Then I trained the model with this:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py data-bin/wmt16_en_de_bpe32k \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 --min-lr 1e-09 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 --weight-decay 0.0\
--max-tokens 4096 --save-dir checkpoints/en-de\
--update-freq 2 --no-progress-bar --log-format json --log-interval 50\
--save-interval-updates 1000 --keep-interval-updates 20
I averaged last 5 checkpoints and generated the translation with this:
model=model.pt
subset="test"
CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/wmt16_en_de_bpe32k \
--path checkpoints/$model --gen-subset $subset\
--beam 4 --batch-size 128 --remove-bpe --lenpen 0.6
However, after about 120k updates, I got :
| Generate test with beam=4: BLEU4 = 26.38, 57.8/32.0/20.0/13.1 (BP=1.000, ratio=1.020, syslen=64352, reflen=63078)
After about 250k updates, I got: | Generate test with beam=4: BLEU4 = 26.39, 57.8/32.0/20.0/13.1 (BP=1.000, ratio=1.017, syslen=64123, reflen=63078)
Far away from the result in “attention is all you need”(27.3). Can you think of any reasons for this? Thanks a lot!
Issue Analytics
- State:
- Created 5 years ago
- Reactions:10
- Comments:23 (7 by maintainers)
Top Results From Across the Web
How to reproduce the result of WMT14 en-de on transformer ...
Hi I want to replicate the WMT14 en-de translation result on transformer BASE model of the paper "attention is all you need".
Read more >arXiv:1908.05672v5 [cs.CL] 20 Jun 2022
Our proposed CTNMT consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained ...
Read more >UNDERSTANDING KNOWLEDGE DISTILLATION IN NON ...
In our experiments, we first run beam search using the base Transformer model with a beam size of 5 then select the sentences...
Read more >Towards Efficient Neural Machine Translation
standard transformer-base [178] model with a vocabulary of 40,000 tokens ... Neural Machine Translation (NMT) is an end-to-end structure which could ...
Read more >Context-Aware Self-Attention Networks
mental results on WMT14 English⇒German and WMT17 ... To this end, we employ the internal ... TRANSFORMER-BASE with context-aware model achieves.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Great! The last step to reproduce results from Vaswani et al. is to split compound words. This step gives a moderate increase in BLEU but is somewhat hacky. In general it’s preferable to report detokenized BLEU via tools like sacrebleu, although detok. BLEU is usually lower than tokenized BLEU. See this paper: https://arxiv.org/abs/1804.08771
Here is the script: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh The compound splitting is near the bottom of the script.
That’s so interesting! After using this script, I got: BLEU = 27.70, 58.9/33.4/21.2/14.1 (BP=1.000, ratio=1.015, hyp_len=65442, ref_len=64496) Meanwhile, I find that the BLEU score of the averaged model which has about 180k updates has already achieved: BLEU = 27.37, 58.6/33.0/21.0/13.8 (BP=1.000, ratio=1.016, hyp_len=65500, ref_len=64496) Thanks again for your help! 👍