question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nonautoregressive-Transformer performs poor

See original GitHub issue

Thank you for publishing the code of NAT. I tried to reproduce the results of vanilla NAT following the instruction. I trained NAT on distilled WMT14 en-de dataset provided by Fairseq with following script:

fairseq-train \
    data-bin/wmt14_en_de_distill \
    --save-dir checkpoints \
    --ddp-backend=no_c10d \
    --task translation_lev \
    --criterion nat_loss \
    --arch nonautoregressive_transformer \
    --noise full_mask \
    --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9,0.98)'  \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --min-lr '1e-09' --warmup-updates 10000 \
    --warmup-init-lr '1e-07' --label-smoothing 0.1 \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --pred-length-offset \
    --length-loss-factor 0.1 \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \
    --max-tokens 8000 \
    --save-interval-updates 10000 \
    --max-update 300000

I tested the performance of NAT with following script:

python generate.py 
   data-bin/wmt14_en_de/ 
--gen-subset test 
--task translation_lev 
--path ../checkpoints/Non-autoregressive_Transformer/checkpoint_best.pt 
--iter-decode-max-iter 0
--beam 1 --remove-bpe 
--print-step 
--batch-size 400

Besides, I also attempted to incorporate copying source embedding, positional attention and layer-wise attention. However, BLEU on test set (newstest2014) is about 7 BLEU, while it’s about 18 BLEU on development set (newstest2013).

I have two following questions about training NAT:

  1. why does a large margin between Fairseq’s implementation(about 7 BLEU) and vanilla NAT(17.35 BLEU) exist? Fairseq’s implementation should obtain better performance due to bigger batch size and better distilled dataset generated by transformer with 27.2 BLEU, compared with vanilla NAT.
  2. The trained NAT performs well on development set, but why does it has a drastic decrease by 11 BLEU on test set?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:16 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
MultiPathcommented, Nov 4, 2019

Hi,

So sorry for replying late.

  1. The fairseq implementation does not use “fertility” as the latent variables in the middle, so it will get a lower BLEU score; the original paper relies on knowledge distillation. We also provided one distillation dataset, you can train on that and it will give you around 18~19 BLEU scores.
  2. Yes, there is a big difference for the valid and test set in terms of BLEU scores. It is a dataset wise difference and absolute BLEU scores are not really comparable.
2reactions
hzhwcmhfcommented, Apr 4, 2021

@MultiPath

Confirmed. I got very similar result after disabling left-pad (removing leftpad + src embedding copy v.s. without src embedding copy). However, it is a little different from your paper (especially from the first one). I think it can help a lot of people if you make it more clearer in the instructions.

https://github.com/pytorch/fairseq/blob/aa5f0119a383e013e56ae5d88e4a7aff0e67f0f9/examples/nonautoregressive_translation/scripts.md

Read more comments on GitHub >

github_iconTop Results From Across the Web

On the Learning of Non-Autoregressive Transformers
Non-autoregressive Transformer (NAT) is a fam- ily of text generation models, which aims to re- duce the decoding latency by predicting the whole....
Read more >
A Survey on Non-Autoregressive Generation for Neural ... - arXiv
Due to the superior performance of the Transformer network, we focus on the Transformer model for discussion in this survey. The encoder is...
Read more >
Glancing Transformer for Non-Autoregressive Neural Machine ...
One open question is whether a complete par- allel decoding model can achieve comparable ma- chine translation performance to the Transformer.
Read more >
PNAT: Non-autoregressive Transformer by Position Learning
In this paper, we propose PNAT, which explicitly models positions of output words as latent variables in text generation.
Read more >
Research on Low Resource Neural Machine Translation ...
Translation Based on Non-autoregressive Model ... single direction of generation, and the autoregressive method can't perform parallel.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found