Validation BLEU differences between fairseq-train and fairseq-generate
See original GitHub issue❓ Questions and Help
When running fairseq-generate
on my validation dataset using the best checkpoint, I’m seeing quite significantly different numbers from the best_bleu
validation metric during fairseq-train
. I traced this difference down a bit, and noticed a few different settings between training and generation:
- During training, we use special tokens like https://github.com/pytorch/fairseq/blob/5ac5e8a20a7a914698f9970c2a384f14015ece3d/fairseq/tasks/translation.py#L453-L462 but not during evaluation, e.g. https://github.com/pytorch/fairseq/blob/5ac5e8a20a7a914698f9970c2a384f14015ece3d/fairseq_cli/generate.py#L242-L248 Why? (It seems that there are some other differences between the two, but I think this is the most prominent.)
- The examples provide
fairseq-generate
commands that, unlike duringfairseq-train
, (a) do not give a detokenizer (I think--tokenizer moses
should be the generation-time counterpart of--eval-bleu-detok moses
, correct? That is absent in the examples.); and (b) do not providemax-len-{a,b}
. Should we supply these flags during generation (or remove during training) to ensure consistency? - Are there any other factors that I’m missing that could contribute to this discrepancy?
Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:5
Top Results From Across the Web
Validation BLEU differences between fairseq-train and ...
Questions and Help When running fairseq-generate on my validation dataset using the best checkpoint, I'm seeing quite significantly ...
Read more >Command-line Tools — fairseq 0.12.2 documentation
Fairseq provides several command-line tools for training and evaluating models: ... raw text with a trained model; fairseq-score: BLEU scoring of generated ......
Read more >Lesser BLEU score compared to Fairseq - Community
Hello, I am trying to do translation from Enlish-Spanish with 35 Million sentences. I created vocabulary of 32K using sentencepiece. Fairseq ...
Read more >fairseq/examples/translation/README.md
In this example we'll train a multilingual {de,fr}-en translation model using the IWSLT'17 datasets. Note that we use slightly different ...
Read more >fairseq: A Fast, Extensible Toolkit for Sequence Modeling
During inference we build batches with a variable number of examples up to a user-specified number of tokens, similar to training. fairseq also...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, I had the same issue, but after adding this flag:
--scoring sacrebleu
it is solved. This is likely because--eval-bleu
uses sacrebleu while generate uses bleu by default. (the gap is as large as 32.16 v.s. 26.51 in my case!) Regarding your questions in 2,--tokenizer moses -s ${SRC} -t ${TGT}
should give you detokenizer. You can add generation args like--beam 5 --max-len-a 1.2 --max-len-b 10 --lenpen 1.1
to match your--eval-bleu
settings.Great, thanks. After adding flag
--scoring sacrebleu
and--tokenizer moses
, it solved the mismatch betweenfairseq-train
andfairseq-generate
.