Nonautoregressive-Transformer performs poor
See original GitHub issueThank you for publishing the code of NAT. I tried to reproduce the results of vanilla NAT following the instruction. I trained NAT on distilled WMT14 en-de dataset provided by Fairseq with following script:
fairseq-train \
data-bin/wmt14_en_de_distill \
--save-dir checkpoints \
--ddp-backend=no_c10d \
--task translation_lev \
--criterion nat_loss \
--arch nonautoregressive_transformer \
--noise full_mask \
--share-all-embeddings \
--optimizer adam --adam-betas '(0.9,0.98)' \
--lr 0.0005 --lr-scheduler inverse_sqrt \
--min-lr '1e-09' --warmup-updates 10000 \
--warmup-init-lr '1e-07' --label-smoothing 0.1 \
--dropout 0.3 --weight-decay 0.01 \
--decoder-learned-pos \
--encoder-learned-pos \
--pred-length-offset \
--length-loss-factor 0.1 \
--apply-bert-init \
--log-format 'simple' --log-interval 100 \
--fixed-validation-seed 7 \
--max-tokens 8000 \
--save-interval-updates 10000 \
--max-update 300000
I tested the performance of NAT with following script:
python generate.py
data-bin/wmt14_en_de/
--gen-subset test
--task translation_lev
--path ../checkpoints/Non-autoregressive_Transformer/checkpoint_best.pt
--iter-decode-max-iter 0
--beam 1 --remove-bpe
--print-step
--batch-size 400
Besides, I also attempted to incorporate copying source embedding, positional attention and layer-wise attention. However, BLEU on test set (newstest2014) is about 7 BLEU, while it’s about 18 BLEU on development set (newstest2013).
I have two following questions about training NAT:
- why does a large margin between Fairseq’s implementation(about 7 BLEU) and vanilla NAT(17.35 BLEU) exist? Fairseq’s implementation should obtain better performance due to bigger batch size and better distilled dataset generated by transformer with 27.2 BLEU, compared with vanilla NAT.
- The trained NAT performs well on development set, but why does it has a drastic decrease by 11 BLEU on test set?
Issue Analytics
- State:
- Created 4 years ago
- Comments:16 (2 by maintainers)
Top Results From Across the Web
On the Learning of Non-Autoregressive Transformers
Non-autoregressive Transformer (NAT) is a fam- ily of text generation models, which aims to re- duce the decoding latency by predicting the whole....
Read more >A Survey on Non-Autoregressive Generation for Neural ... - arXiv
Due to the superior performance of the Transformer network, we focus on the Transformer model for discussion in this survey. The encoder is...
Read more >Glancing Transformer for Non-Autoregressive Neural Machine ...
One open question is whether a complete par- allel decoding model can achieve comparable ma- chine translation performance to the Transformer.
Read more >PNAT: Non-autoregressive Transformer by Position Learning
In this paper, we propose PNAT, which explicitly models positions of output words as latent variables in text generation.
Read more >Research on Low Resource Neural Machine Translation ...
Translation Based on Non-autoregressive Model ... single direction of generation, and the autoregressive method can't perform parallel.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi,
So sorry for replying late.
@MultiPath
Confirmed. I got very similar result after disabling left-pad (removing leftpad + src embedding copy v.s. without src embedding copy). However, it is a little different from your paper (especially from the first one). I think it can help a lot of people if you make it more clearer in the instructions.
https://github.com/pytorch/fairseq/blob/aa5f0119a383e013e56ae5d88e4a7aff0e67f0f9/examples/nonautoregressive_translation/scripts.md