Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using Bleu for batch input

See original GitHub issue

The example provided in docs for Bleu is for Single input. In that case the output from the engine should be in this format


def evaluate_step():
  ...
  predictions = "Predicted Sentence 1".split()
  references = ["Reference Sentence 1".split(), "Reference Sentence 1.2".split()]
  return (predictions, references)

When calculating Bleu score for a Batch what is the format for output from the engine? It should be something like

#For a batch of size 2
predictions = ["Predicted Sentence 1".split(), "Predicted Sentence 2".split()]
references = [["Reference Sentence 1.1".split(), "Reference Sentence 1.2".split()], ["Reference Sentence 2.1".split(), "Reference Sentence 2.2".split()]]

Doing this gives an error.

TypeError: unhashable type: 'list'

The typing for update requires predictions to be a Sequence[Any] and References to be Sequence[Sequence[Any]]]. Does this mean the batch input is not possible?

Issue Analytics

State:
Created 2 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

sdesroziscommented, Aug 25, 2021

What you mentioned is correct only if the input is the whole corpus.

The corpus is split in batch, so we have to firstly accumulate the ngram counter, then finally compute the bp and mean when the corpus has been covered. It implies multiple update calls (accumulation) and one compute (bp + mean). In other words, the micro average you suggested is not correct.

And if you consider the distributed computing, it means each processor has a part of the corpus. Again, we accumulate, (sync) then bp + mean.

1reaction

sdesroziscommented, Aug 25, 2021

Firstly, the _corpus_bleu should be split in 2 parts : (1) the accumulation of ngrams counter (maybe the lenghts of cand/hyp too) and (2) the computation of bp, smoothing and geometric mean. The sentence bleu score (ie macro avg) is (1) + (2) for each sentence and (3) average. The corpus bleu score (ie micro avg) is (1) for each sentence then (2).

For batch version, the macro avg means apply the sentence score to each sentence of the batch. We must add a loop over the batch, it is quite straighforward. The micro avg is natively fine with batch version (apply (1) on batch then (2)).

For DDP, the work is fine for the actual macro avg. Concerning the micro avg version, a synchronization is needed to sum the different counters and lenghts.

HTH

Top Results From Across the Web

Using Bleu for batch input - Pytorch/Ignite - Codesti

When calculating Bleu score for a Batch what is the format for output from the engine? It should be something like. #For a...

A Gentle Introduction to Calculating the BLEU Score for Text in ...

BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.

How to calculate BLEU Score in Python? - DigitalOcean

1. Input and Split the sentences · 2. Calculate the BLEU score in Python · 3. Complete Code for Implementing BLEU Score in...

What exact inputs does bleu_metric.compute() require?

I want to use the BLEU metric from the nlp library, but I'm having pro… ... batch = tokenizer.prepare_translation_batch(src_texts=src_texts, ...

[BC-breaking] added Bleu calculation on batch input - GitHub

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. - [BC-breaking] added Bleu calculation on ...