question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

corpus_bleu does not match multi-bleu.perl for very poor translations

See original GitHub issue

example:

tokenized reference:

Their tasks include changing a pump on the faulty stokehold .
Likewise , two species that are very similar in morphology were distinguished using genetics .

tokenized hypothesis:

Teo S yb , oe uNb , R , T t , , t
Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R

with the multi-bleu.perl script the BLEU score is 0:

BLEU = 0.00, 3.4/0.0/0.0/0.0 (BP=1.000, ratio=1.115, hyp_len=29, ref_len=26)

with corpus_bleu the BLUE score is 43.092382, my understanding is that default corpus_bleu settings corresponds to the multi-bleu.perl script.


from nltk.translate.bleu_score import corpus_bleu

references = [
    'Their tasks include changing a pump on the faulty stokehold .',
    'Likewise , two species that are very similar in morphology were distinguished using genetics .'

]

hypothesis = [
    'Teo S yb , oe uNb , R , T t , , t',
    'Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R'
]

hypothesis_tokens = [line.split(' ') for line in hypothesis]
references_tokens = [[line.split(' ')] for line in references]

# calculate corpus-bleu score
bleu = corpus_bleu(
    references_tokens, hypothesis_tokens
)

print('BLEU: %f' % (bleu * 100)) # 43.092382

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
alvationscommented, Mar 19, 2017

@AndreasMadsen actually I’ve worked through the math by hand and 17 BLEU for the example you’ve given is really similar to the “the the the the the the the” example in the original BLEU paper, see also https://gist.github.com/alvations/e5922afa8c91472d25c58b2d712a93e7 . In code:

>>> reference1 = 'the cat is on the mat'.split()
>>> reference2 = 'there is a cat on the mat'.split()
>>> hypothesis1 = 'the the the the the the the'.split()
>>> references = [reference1, reference2]
>>> float(modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS
0.2857...
>>> bleu([ref1, ref2], hyp)
0.7311104457090247

The 17 BLEU (smoothed) or 34 BLEU (unsmoothed, taking only unigrams into account) is coming from the modified precision being ridiculously high from the punctuation. One quick hack to check for bad sentences is to remove the punctuation before running BLEU, e.g.:

>>> from nltk.translate.bleu_score import corpus_bleu
>>> from string import punctuation
>>> 
>>> hyp = "Teo S yb , oe uNb , R , T t , , t Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R"
>>> ref = "Their tasks include changing a pump on the faulty stokehold . Likewise , two species that are very similar in morphology were distinguished using genetics ."
>>> 
>>> hyp = ''.join([ch for ch in hyp if ch not in punctuation])
>>> ref = ''.join([ch for ch in ref if ch not in punctuation])
>>> corpus_bleu([[ref.split()]], [hyp.split()])
0

The implementation of the basic BLEU score in NLTK isn’t exactly flawed but it has never meant to emulate the effects of multi-bleu.pl because multi-bleu.pl put in a hack that wasn’t considered in the paper, i.e. to return a 0.0 score if any of the 1- to 4- gram returns a 0.0 precision.

To emulate that behavior, I’ve added similar “hack” to

  1. set the precision (a fraction/floating point) to the smallest negative value that any system can take and then
  2. taking the log of it would have caused it to end up with large negative value and
  3. when the BLEU formula takes the exponential of the modified precision before multiplying it with the brevity penalty, it should set the precision back to the minimal negative value and
  4. the final rounding off that’s kicked in when emulate_multibleu parameter is set to True would cause the final BLEU value to be clipped at 0, i.e.:
>>> import sys
>>> from math import log, exp
>>> sys.float_info.min
2.2250738585072014e-308
>>> log(sys.float_info.min)
-708.3964185322641
>>> exp(log(sys.float_info.min))
2.2250738585072626e-308
>>> round(exp(log(sys.float_info.min)), 4)
0.0
0reactions
AndreasMadsencommented, Mar 19, 2017

@alvations I see, thanks for looking into it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Computing and reporting BLEU scores - Mathias Müller
The only right answer is: impossible to know without more information. Not all BLEU scores are created equal. There are plenty of ways...
Read more >
nltk.translate.bleu_score - NLTK
If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the...
Read more >
Computing BLEU Score for Machine Translation
BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target...
Read more >
BLEU scores per Epoch using multi_bleu.perl - Google Groups
The other error message happens if your BLEU score is zero, which can happen if your translation is really bad, or if the...
Read more >
arXiv:2208.13170v1 [cs.CL] 28 Aug 2022
all translated by humans. Tatoeba is not large and the sentences are relatively short. TED is not a na- tive French or Japanese...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found