question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validation BLEU differences between fairseq-train and fairseq-generate

See original GitHub issue

❓ Questions and Help

When running fairseq-generate on my validation dataset using the best checkpoint, I’m seeing quite significantly different numbers from the best_bleu validation metric during fairseq-train. I traced this difference down a bit, and noticed a few different settings between training and generation:

  1. During training, we use special tokens like https://github.com/pytorch/fairseq/blob/5ac5e8a20a7a914698f9970c2a384f14015ece3d/fairseq/tasks/translation.py#L453-L462 but not during evaluation, e.g. https://github.com/pytorch/fairseq/blob/5ac5e8a20a7a914698f9970c2a384f14015ece3d/fairseq_cli/generate.py#L242-L248 Why? (It seems that there are some other differences between the two, but I think this is the most prominent.)
  2. The examples provide fairseq-generate commands that, unlike during fairseq-train, (a) do not give a detokenizer (I think --tokenizer moses should be the generation-time counterpart of --eval-bleu-detok moses, correct? That is absent in the examples.); and (b) do not provide max-len-{a,b}. Should we supply these flags during generation (or remove during training) to ensure consistency?
  3. Are there any other factors that I’m missing that could contribute to this discrepancy?

Thanks!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:3
  • Comments:5

github_iconTop GitHub Comments

7reactions
George0828Zhangcommented, Jul 7, 2021

Hi, I had the same issue, but after adding this flag: --scoring sacrebleu it is solved. This is likely because --eval-bleu uses sacrebleu while generate uses bleu by default. (the gap is as large as 32.16 v.s. 26.51 in my case!) Regarding your questions in 2, --tokenizer moses -s ${SRC} -t ${TGT} should give you detokenizer. You can add generation args like --beam 5 --max-len-a 1.2 --max-len-b 10 --lenpen 1.1 to match your --eval-bleu settings.

0reactions
yc1999commented, Dec 1, 2022

Hi, I had the same issue, but after adding this flag: --scoring sacrebleu it is solved. This is likely because --eval-bleu uses sacrebleu while generate uses bleu by default. (the gap is as large as 32.16 v.s. 26.51 in my case!) Regarding your questions in 2, --tokenizer moses -s ${SRC} -t ${TGT} should give you detokenizer. You can add generation args like --beam 5 --max-len-a 1.2 --max-len-b 10 --lenpen 1.1 to match your --eval-bleu settings.

Great, thanks. After adding flag --scoring sacrebleu and --tokenizer moses, it solved the mismatch between fairseq-train and fairseq-generate.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Validation BLEU differences between fairseq-train and ...
Questions and Help When running fairseq-generate on my validation dataset using the best checkpoint, I'm seeing quite significantly ...
Read more >
Command-line Tools — fairseq 0.12.2 documentation
Fairseq provides several command-line tools for training and evaluating models: ... raw text with a trained model; fairseq-score: BLEU scoring of generated ......
Read more >
Lesser BLEU score compared to Fairseq - Community
Hello, I am trying to do translation from Enlish-Spanish with 35 Million sentences. I created vocabulary of 32K using sentencepiece. Fairseq ...
Read more >
fairseq/examples/translation/README.md
In this example we'll train a multilingual {de,fr}-en translation model using the IWSLT'17 datasets. Note that we use slightly different ...
Read more >
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
During inference we build batches with a variable number of examples up to a user-specified number of tokens, similar to training. fairseq also...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found