Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenized BLEU considered harmful - Discussion on community-based process

See original GitHub issue

https://github.com/huggingface/nlp/blob/7d1526dfeeb29248d832f1073192dbf03ad642da/metrics/bleu/bleu.py#L76 assumes the inputs are tokenized by the user. This is bad practice because the user’s tokenizer is usually not the same as the one used by mteval-v13a.pl, the closest thing we have to a standard. Moreover, tokenizers are like window managers: they can be endlessly customized and nobody has quite the same options.

As @mjpost reported in https://www.aclweb.org/anthology/W18-6319.pdf BLEU configurations can vary by 1.8. Yet people are incorrectly putting non-comparable BLEU scores in the same table, such as Table 1 in https://arxiv.org/abs/2004.04902 .

There are a few use cases for tokenized BLEU like Thai. For Chinese, people seem to use character BLEU for better or worse.

The default easy option should be the one that’s correct more often. And that is sacrebleu. Please don’t make it easy for people to run what is usually the wrong option; it definitely shouldn’t be bleu.

Also, I know this is inherited from TensorFlow and, paging @lmthang, they should discourage it too.

Issue Analytics

State:
Created 3 years ago
Reactions:15
Comments:11 (1 by maintainers)

Top GitHub Comments

7reactions

mjpostcommented, May 16, 2020

Yes, there are slides like that at WMT every year 😃 BLEU correlates with human judgment only at coarse levels, and it seems to be getting worse when people try to use it to do model selection among high-performing neural systems.

However, the point isn’t whether BLEU is a good metric, but whether your BLEU score can be compared to other BLEU scores. They only can be compared if you use the same reference tokenization (similar to how you can’t compare LM perplexities across different segmentations). sacrebleu was an attempt to get everyone to use WMT’s reference tokenization (meaning, your system has to first remove its own tokenization) so that you could just compare across papers. This also prevents scores from being gamed.

1reaction

kpucommented, Jan 7, 2021

Use sacrebleu on detokenized output and raw unmodified references.

Top Results From Across the Web

Tokenized BLEU considered harmful - Discussion on ... - GitHub

The bottom line is that scores produced with different reference tokenizations are not comparable. To discourage (even inadvertent) cheating, ...

12 Critical Flaws of BLEU - Medium

The dependence on tokenization is a very common flaw among evaluation metrics for NLP. BLEU is computed on sequences of tokens. Consequently, ...

Computing and reporting BLEU scores - Mathias Müller

However, the metrics task also showed that BLEU is really bad at telling the difference between high-quality MT outputs and high-quality human ...

A Survey of Evaluation Metrics Used for NLG Systems - arXiv

Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG ...

A Structured Review of the Validity of BLEU - ACL Anthology

2002) is a metric that is widely used to evaluate Natural Language. Processing (NLP) systems which produce language, especially machine translation. (MT) and ......