Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Seq2Seq Metrics QOL: Bleu, Rouge

See original GitHub issue

Putting all my QOL issues here, idt I will have time to propose fixes, but I didn’t want these to be lost, in case they are useful. I tried using rouge and bleu for the first time and wrote down everything I didn’t immediately understand:

Bleu expects tokenization, can I just kwarg it like sacrebleu?
different signatures, means that I would have had to add a lot of conditionals + pre and post processing: if I were going to replace the calculate_rouge and calculate_bleu functions here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L61

What I tried

Rouge experience:


rouge = load_metric('rouge')
rouge.add_batch(['hi im sam'], ['im daniel']) # fails
rouge.add_batch(predictions=['hi im sam'], references=['im daniel']) # works
rouge.compute() # huge messy output, but reasonable. Not worth integrating b/c don't want to rewrite all the postprocessing.

BLEU experience:

bleu = load_metric('bleu')
bleu.add_batch(predictions=['hi im sam'], references=['im daniel'])
bleu.add_batch(predictions=[['hi im sam']], references=[['im daniel']])

bleu.add_batch(predictions=[['hi im sam']], references=[['im daniel']])

All of these raise ValueError: Got a string but expected a list instead: 'im daniel'

Doc Typo

This says dataset=load_metric(...) which seems wrong, will cause NameError

cc @lhoestq, feel free to ignore.

Issue Analytics

State:
Created 3 years ago
Reactions:7
Comments:5 (3 by maintainers)

Top GitHub Comments

8reactions

lhoestqcommented, Jan 28, 2021

Hi !

As described in the documentation for bleu:

Args:
    predictions: list of translations to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.

Therefore you can use this metric this way:

from datasets import load_metric

predictions = [
    ["hello", "there", "general", "kenobi"],                             # tokenized prediction of the first sample
    ["foo", "bar", "foobar"]                                             # tokenized prediction of the second sample
]
references = [
    [["hello", "there", "general", "kenobi"], ["hello", "there", "!"]],  # tokenized references for the first sample (2 references)
    [["foo", "bar", "foobar"]]                                           # tokenized references for the second sample (1 reference)
]

bleu = load_metric("bleu")
bleu.compute(predictions=predictions, references=references)
# Or you can also add batches before calling compute()
# bleu.add_batch(predictions=predictions, references=references)
# bleu.compute()

Hope this helps 😃

5reactions

mrm8488commented, Nov 12, 2020

So what is the right way to add a batch to compute BLEU?