Tokenized BLEU considered harmful - Discussion on community-based process
See original GitHub issuehttps://github.com/huggingface/nlp/blob/7d1526dfeeb29248d832f1073192dbf03ad642da/metrics/bleu/bleu.py#L76 assumes the inputs are tokenized by the user. This is bad practice because the user’s tokenizer is usually not the same as the one used by mteval-v13a.pl
, the closest thing we have to a standard. Moreover, tokenizers are like window managers: they can be endlessly customized and nobody has quite the same options.
As @mjpost reported in https://www.aclweb.org/anthology/W18-6319.pdf BLEU configurations can vary by 1.8. Yet people are incorrectly putting non-comparable BLEU scores in the same table, such as Table 1 in https://arxiv.org/abs/2004.04902 .
There are a few use cases for tokenized BLEU like Thai. For Chinese, people seem to use character BLEU for better or worse.
The default easy option should be the one that’s correct more often. And that is sacrebleu. Please don’t make it easy for people to run what is usually the wrong option; it definitely shouldn’t be bleu
.
Also, I know this is inherited from TensorFlow and, paging @lmthang, they should discourage it too.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:15
- Comments:11 (1 by maintainers)
Top GitHub Comments
Yes, there are slides like that at WMT every year 😃 BLEU correlates with human judgment only at coarse levels, and it seems to be getting worse when people try to use it to do model selection among high-performing neural systems.
However, the point isn’t whether BLEU is a good metric, but whether your BLEU score can be compared to other BLEU scores. They only can be compared if you use the same reference tokenization (similar to how you can’t compare LM perplexities across different segmentations). sacrebleu was an attempt to get everyone to use WMT’s reference tokenization (meaning, your system has to first remove its own tokenization) so that you could just compare across papers. This also prevents scores from being gamed.
Use sacrebleu on detokenized output and raw unmodified references.