question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenized BLEU considered harmful - Discussion on community-based process

See original GitHub issue

https://github.com/huggingface/nlp/blob/7d1526dfeeb29248d832f1073192dbf03ad642da/metrics/bleu/bleu.py#L76 assumes the inputs are tokenized by the user. This is bad practice because the user’s tokenizer is usually not the same as the one used by mteval-v13a.pl, the closest thing we have to a standard. Moreover, tokenizers are like window managers: they can be endlessly customized and nobody has quite the same options.

As @mjpost reported in https://www.aclweb.org/anthology/W18-6319.pdf BLEU configurations can vary by 1.8. Yet people are incorrectly putting non-comparable BLEU scores in the same table, such as Table 1 in https://arxiv.org/abs/2004.04902 .

There are a few use cases for tokenized BLEU like Thai. For Chinese, people seem to use character BLEU for better or worse.

The default easy option should be the one that’s correct more often. And that is sacrebleu. Please don’t make it easy for people to run what is usually the wrong option; it definitely shouldn’t be bleu.

Also, I know this is inherited from TensorFlow and, paging @lmthang, they should discourage it too.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:15
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

7reactions
mjpostcommented, May 16, 2020

Yes, there are slides like that at WMT every year 😃 BLEU correlates with human judgment only at coarse levels, and it seems to be getting worse when people try to use it to do model selection among high-performing neural systems.

However, the point isn’t whether BLEU is a good metric, but whether your BLEU score can be compared to other BLEU scores. They only can be compared if you use the same reference tokenization (similar to how you can’t compare LM perplexities across different segmentations). sacrebleu was an attempt to get everyone to use WMT’s reference tokenization (meaning, your system has to first remove its own tokenization) so that you could just compare across papers. This also prevents scores from being gamed.

1reaction
kpucommented, Jan 7, 2021

Use sacrebleu on detokenized output and raw unmodified references.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenized BLEU considered harmful - Discussion on ... - GitHub
The bottom line is that scores produced with different reference tokenizations are not comparable. To discourage (even inadvertent) cheating, ...
Read more >
12 Critical Flaws of BLEU - Medium
The dependence on tokenization is a very common flaw among evaluation metrics for NLP. BLEU is computed on sequences of tokens. Consequently, ...
Read more >
Computing and reporting BLEU scores - Mathias Müller
However, the metrics task also showed that BLEU is really bad at telling the difference between high-quality MT outputs and high-quality human ...
Read more >
A Survey of Evaluation Metrics Used for NLG Systems - arXiv
Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG ...
Read more >
A Structured Review of the Validity of BLEU - ACL Anthology
2002) is a metric that is widely used to evaluate Natural Language. Processing (NLP) systems which produce language, especially machine translation. (MT) and ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found