Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nonautoregressive-Transformer performs poor

See original GitHub issue

Thank you for publishing the code of NAT. I tried to reproduce the results of vanilla NAT following the instruction. I trained NAT on distilled WMT14 en-de dataset provided by Fairseq with following script:

fairseq-train \
    data-bin/wmt14_en_de_distill \
    --save-dir checkpoints \
    --ddp-backend=no_c10d \
    --task translation_lev \
    --criterion nat_loss \
    --arch nonautoregressive_transformer \
    --noise full_mask \
    --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9,0.98)'  \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --min-lr '1e-09' --warmup-updates 10000 \
    --warmup-init-lr '1e-07' --label-smoothing 0.1 \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --pred-length-offset \
    --length-loss-factor 0.1 \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \
    --max-tokens 8000 \
    --save-interval-updates 10000 \
    --max-update 300000

I tested the performance of NAT with following script:

python generate.py 
   data-bin/wmt14_en_de/ 
--gen-subset test 
--task translation_lev 
--path ../checkpoints/Non-autoregressive_Transformer/checkpoint_best.pt 
--iter-decode-max-iter 0
--beam 1 --remove-bpe 
--print-step 
--batch-size 400

Besides, I also attempted to incorporate copying source embedding, positional attention and layer-wise attention. However, BLEU on test set (newstest2014) is about 7 BLEU, while it’s about 18 BLEU on development set (newstest2013).

I have two following questions about training NAT:

why does a large margin between Fairseq’s implementation(about 7 BLEU) and vanilla NAT(17.35 BLEU) exist? Fairseq’s implementation should obtain better performance due to bigger batch size and better distilled dataset generated by transformer with 27.2 BLEU, compared with vanilla NAT.
The trained NAT performs well on development set, but why does it has a drastic decrease by 11 BLEU on test set?

Issue Analytics

State:
Created 4 years ago
Comments:16 (2 by maintainers)

Top GitHub Comments

3reactions

MultiPathcommented, Nov 4, 2019

Hi,

So sorry for replying late.

The fairseq implementation does not use “fertility” as the latent variables in the middle, so it will get a lower BLEU score; the original paper relies on knowledge distillation. We also provided one distillation dataset, you can train on that and it will give you around 18~19 BLEU scores.
Yes, there is a big difference for the valid and test set in terms of BLEU scores. It is a dataset wise difference and absolute BLEU scores are not really comparable.

2reactions

hzhwcmhfcommented, Apr 4, 2021

@MultiPath

Confirmed. I got very similar result after disabling left-pad (removing leftpad + src embedding copy v.s. without src embedding copy). However, it is a little different from your paper (especially from the first one). I think it can help a lot of people if you make it more clearer in the instructions.

https://github.com/pytorch/fairseq/blob/aa5f0119a383e013e56ae5d88e4a7aff0e67f0f9/examples/nonautoregressive_translation/scripts.md