Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validation BLEU differences between fairseq-train and fairseq-generate

See original GitHub issue

❓ Questions and Help

When running fairseq-generate on my validation dataset using the best checkpoint, I’m seeing quite significantly different numbers from the best_bleu validation metric during fairseq-train. I traced this difference down a bit, and noticed a few different settings between training and generation:

During training, we use special tokens like https://github.com/pytorch/fairseq/blob/5ac5e8a20a7a914698f9970c2a384f14015ece3d/fairseq/tasks/translation.py#L453-L462 but not during evaluation, e.g. https://github.com/pytorch/fairseq/blob/5ac5e8a20a7a914698f9970c2a384f14015ece3d/fairseq_cli/generate.py#L242-L248 Why? (It seems that there are some other differences between the two, but I think this is the most prominent.)
The examples provide fairseq-generate commands that, unlike during fairseq-train, (a) do not give a detokenizer (I think --tokenizer moses should be the generation-time counterpart of --eval-bleu-detok moses, correct? That is absent in the examples.); and (b) do not provide max-len-{a,b}. Should we supply these flags during generation (or remove during training) to ensure consistency?
Are there any other factors that I’m missing that could contribute to this discrepancy?

Thanks!

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:5

Top GitHub Comments

7reactions

George0828Zhangcommented, Jul 7, 2021

Hi, I had the same issue, but after adding this flag: --scoring sacrebleu it is solved. This is likely because --eval-bleu uses sacrebleu while generate uses bleu by default. (the gap is as large as 32.16 v.s. 26.51 in my case!) Regarding your questions in 2, --tokenizer moses -s ${SRC} -t ${TGT} should give you detokenizer. You can add generation args like --beam 5 --max-len-a 1.2 --max-len-b 10 --lenpen 1.1 to match your --eval-bleu settings.

0reactions

yc1999commented, Dec 1, 2022

Hi, I had the same issue, but after adding this flag: --scoring sacrebleu it is solved. This is likely because --eval-bleu uses sacrebleu while generate uses bleu by default. (the gap is as large as 32.16 v.s. 26.51 in my case!) Regarding your questions in 2, --tokenizer moses -s ${SRC} -t ${TGT} should give you detokenizer. You can add generation args like --beam 5 --max-len-a 1.2 --max-len-b 10 --lenpen 1.1 to match your --eval-bleu settings.

Great, thanks. After adding flag --scoring sacrebleu and --tokenizer moses, it solved the mismatch between fairseq-train and fairseq-generate.

Top Results From Across the Web

Validation BLEU differences between fairseq-train and ...

Questions and Help When running fairseq-generate on my validation dataset using the best checkpoint, I'm seeing quite significantly ...

Command-line Tools — fairseq 0.12.2 documentation

Fairseq provides several command-line tools for training and evaluating models: ... raw text with a trained model; fairseq-score: BLEU scoring of generated ......

Lesser BLEU score compared to Fairseq - Community

Hello, I am trying to do translation from Enlish-Spanish with 35 Million sentences. I created vocabulary of 32K using sentencepiece. Fairseq ...

fairseq/examples/translation/README.md

In this example we'll train a multilingual {de,fr}-en translation model using the IWSLT'17 datasets. Note that we use slightly different ...

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

During inference we build batches with a variable number of examples up to a user-specified number of tokens, similar to training. fairseq also...