Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

corpus_bleu does not match multi-bleu.perl for very poor translations

See original GitHub issue

example:

tokenized reference:

Their tasks include changing a pump on the faulty stokehold .
Likewise , two species that are very similar in morphology were distinguished using genetics .

tokenized hypothesis:

Teo S yb , oe uNb , R , T t , , t
Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R

with the multi-bleu.perl script the BLEU score is 0:

BLEU = 0.00, 3.4/0.0/0.0/0.0 (BP=1.000, ratio=1.115, hyp_len=29, ref_len=26)

with corpus_bleu the BLUE score is 43.092382, my understanding is that default corpus_bleu settings corresponds to the multi-bleu.perl script.

from nltk.translate.bleu_score import corpus_bleu

references = [
    'Their tasks include changing a pump on the faulty stokehold .',
    'Likewise , two species that are very similar in morphology were distinguished using genetics .'

]

hypothesis = [
    'Teo S yb , oe uNb , R , T t , , t',
    'Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R'
]

hypothesis_tokens = [line.split(' ') for line in hypothesis]
references_tokens = [[line.split(' ')] for line in references]

# calculate corpus-bleu score
bleu = corpus_bleu(
    references_tokens, hypothesis_tokens
)

print('BLEU: %f' % (bleu * 100)) # 43.092382

Issue Analytics

State:
Created 7 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

3reactions

alvationscommented, Mar 19, 2017

@AndreasMadsen actually I’ve worked through the math by hand and 17 BLEU for the example you’ve given is really similar to the “the the the the the the the” example in the original BLEU paper, see also https://gist.github.com/alvations/e5922afa8c91472d25c58b2d712a93e7 . In code:

>>> reference1 = 'the cat is on the mat'.split()
>>> reference2 = 'there is a cat on the mat'.split()
>>> hypothesis1 = 'the the the the the the the'.split()
>>> references = [reference1, reference2]
>>> float(modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS
0.2857...
>>> bleu([ref1, ref2], hyp)
0.7311104457090247

The 17 BLEU (smoothed) or 34 BLEU (unsmoothed, taking only unigrams into account) is coming from the modified precision being ridiculously high from the punctuation. One quick hack to check for bad sentences is to remove the punctuation before running BLEU, e.g.:

>>> from nltk.translate.bleu_score import corpus_bleu
>>> from string import punctuation
>>> 
>>> hyp = "Teo S yb , oe uNb , R , T t , , t Tue Ar saln S , , 5istsi l , 5oe R ulO sae oR R"
>>> ref = "Their tasks include changing a pump on the faulty stokehold . Likewise , two species that are very similar in morphology were distinguished using genetics ."
>>> 
>>> hyp = ''.join([ch for ch in hyp if ch not in punctuation])
>>> ref = ''.join([ch for ch in ref if ch not in punctuation])
>>> corpus_bleu([[ref.split()]], [hyp.split()])
0

The implementation of the basic BLEU score in NLTK isn’t exactly flawed but it has never meant to emulate the effects of multi-bleu.pl because multi-bleu.pl put in a hack that wasn’t considered in the paper, i.e. to return a 0.0 score if any of the 1- to 4- gram returns a 0.0 precision.

To emulate that behavior, I’ve added similar “hack” to

set the precision (a fraction/floating point) to the smallest negative value that any system can take and then
taking the log of it would have caused it to end up with large negative value and
when the BLEU formula takes the exponential of the modified precision before multiplying it with the brevity penalty, it should set the precision back to the minimal negative value and
the final rounding off that’s kicked in when emulate_multibleu parameter is set to True would cause the final BLEU value to be clipped at 0, i.e.:

>>> import sys
>>> from math import log, exp
>>> sys.float_info.min
2.2250738585072014e-308
>>> log(sys.float_info.min)
-708.3964185322641
>>> exp(log(sys.float_info.min))
2.2250738585072626e-308
>>> round(exp(log(sys.float_info.min)), 4)
0.0

0reactions

AndreasMadsencommented, Mar 19, 2017

@alvations I see, thanks for looking into it.