corpus_bleu does not match multi-bleu.perl for very poor translations
See original GitHub issueexample:
tokenized reference:
Their tasks include changing a pump on the faulty stokehold .
Likewise , two species that are very similar in morphology were distinguished using genetics .
tokenized hypothesis:
Teo S yb , oe uNb , R , T t , , t
Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R
with the multi-bleu.perl
script the BLEU score is 0
:
BLEU = 0.00, 3.4/0.0/0.0/0.0 (BP=1.000, ratio=1.115, hyp_len=29, ref_len=26)
with corpus_bleu
the BLUE score is 43.092382
, my understanding is that default corpus_bleu
settings corresponds to the multi-bleu.perl
script.
from nltk.translate.bleu_score import corpus_bleu
references = [
'Their tasks include changing a pump on the faulty stokehold .',
'Likewise , two species that are very similar in morphology were distinguished using genetics .'
]
hypothesis = [
'Teo S yb , oe uNb , R , T t , , t',
'Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R'
]
hypothesis_tokens = [line.split(' ') for line in hypothesis]
references_tokens = [[line.split(' ')] for line in references]
# calculate corpus-bleu score
bleu = corpus_bleu(
references_tokens, hypothesis_tokens
)
print('BLEU: %f' % (bleu * 100)) # 43.092382
Issue Analytics
- State:
- Created 7 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Computing and reporting BLEU scores - Mathias Müller
The only right answer is: impossible to know without more information. Not all BLEU scores are created equal. There are plenty of ways...
Read more >nltk.translate.bleu_score - NLTK
If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the...
Read more >Computing BLEU Score for Machine Translation
BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target...
Read more >BLEU scores per Epoch using multi_bleu.perl - Google Groups
The other error message happens if your BLEU score is zero, which can happen if your translation is really bad, or if the...
Read more >arXiv:2208.13170v1 [cs.CL] 28 Aug 2022
all translated by humans. Tatoeba is not large and the sentences are relatively short. TED is not a na- tive French or Japanese...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@AndreasMadsen actually I’ve worked through the math by hand and 17 BLEU for the example you’ve given is really similar to the “the the the the the the the” example in the original BLEU paper, see also https://gist.github.com/alvations/e5922afa8c91472d25c58b2d712a93e7 . In code:
The 17 BLEU (smoothed) or 34 BLEU (unsmoothed, taking only unigrams into account) is coming from the modified precision being ridiculously high from the punctuation. One quick hack to check for bad sentences is to remove the punctuation before running BLEU, e.g.:
The implementation of the basic BLEU score in NLTK isn’t exactly flawed but it has never meant to emulate the effects of
multi-bleu.pl
becausemulti-bleu.pl
put in a hack that wasn’t considered in the paper, i.e. to return a 0.0 score if any of the 1- to 4- gram returns a 0.0 precision.To emulate that behavior, I’ve added similar “hack” to
emulate_multibleu
parameter is set to True would cause the final BLEU value to be clipped at 0, i.e.:@alvations I see, thanks for looking into it.