Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Interpretation of results generated by “fairseq-generate”

See original GitHub issue

Fairseq-generate Produces results in a form of a text file with information available in numerical form. My question is - what does the S, D, T, Hstand for? what does the 2 float values represent? I had used —evaluate-bleu Flag, but I do no see my BLEU scores anywhere. Can anyone explain what do they represent?

Also, I would recommend updating the docs on such information as to aid people who would have this issue in the future. Right now, the docs aren’t very great which indirectly leads to a large amount of issue has people aren’t able to understand the correct options to use.

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:7 (1 by maintainers)

Top GitHub Comments

25reactions

xfrenettecommented, Jan 12, 2021

The different lines you get (all examples are for a French-to-English model):

S- is the Source sentence the model has to translate (in our example it could be: S-0 Bonjour, mon nom est Albert Einstein.)
T- is the Target (or reference or “gold”) sentence you provided for this source, the one you want to compare to (in our example, it’s the “official” human translated English version: T-0 Hello, my name is Albert Einstein.). It’s possible to run fairseq-generate without target sentences, in which case this line won’t appear.
H- is the tokenized Hypothesis (or system) translation (i.e. the tokens generated by your model) along with its score. If your model works with sub-word tokens (ex: BPE), this line will be sub-words tokens, separated by spaces. Even if your model works with whole words but considers punctuation symbols as tokens, they will be space-separated. For example, you might obtain something like that with BPE and Moses tokenization (this hypothesis has a score of 0.654): H-0 -0.654 Hi , my name is Alb@@ ert Ein@@ st@@ ein ..
D- is the same as H- but Detokenized (after applying BPE and word tokenization in reverse). For example, you would have D-0 -0.654 Hi, my name is Albert Einstein.

In those example, the number after the letter (ex: 0 in H-0) is the ID of the sample which, from what I understand, seems to simply be its index. Translations for sentences in your test set might not be generated sequentially, so you can use this ID to reorder results.

As for --evaluate-bleu, it doesn’t seem to be a valid argument anymore. There is though a --eval-bleu in the translation task, but it might only be used during training to calculate a BLEU score on the validation set.

From your fairseq-generate output file, you could calculate a BLEU score like so:

First keep only target sentences (T- lines) and generated translations (D- lines) from your output file (ex: gen.out) and separates them in two files:

grep ^H gen.out | cut -f3- > gen.out.sys
grep ^T gen.out | cut -f2- > gen.out.ref

Then run fairseq-score (see arguments here):

fairseq-score --sys gen.out.sys --ref gen.out.ref

You would get something like

BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)

2reactions

ShunchiZhangcommented, Jul 15, 2022

Hi, @xfrenette! Thanks for your detailed explanation! It really resolved most of my issues. But I still have 2 questions as follows:

At the end of your explanation:
You would get something like
```
BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
```
How do we interpret the last 4 numbers (67.5/46.9/34.4/25.5)?

I guess they represent BLEU2/3/5/6 instead of BLEU4 since they are decreasing. But I’m not very sure about it.

Here is one more line at last starting with P (probably because of the updates for fairseq-generate):

S-3459  ich kann helfen .
T-3459  i can help you .
H-3459  -0.16354285180568695    i can help .
D-3459  -0.16354285180568695    i can help .
P-3459  -0.2249 -0.2233 -0.1361 -0.1245 -0.1090

Do they stand for $\log_2$ Probability of beam search for each subwords?

Any responses are highly appreciated. Thanks!

Top Results From Across the Web

Interpretation of results generated by “fairseq-generate” #3000

Fairseq-generate Produces results in a form of a text file with information available in numerical form. My question is - what does the...

How to interpret the P numbers that fairseq generate produces?

Using fairseq-generate.py, with the transformer architecture, each translation produces a section like this: Why is it rare to discover new ...

Command-line Tools — fairseq 0.12.2 documentation

fairseq-train: Train a new model on one or multiple GPUs; fairseq-generate: Translate pre-processed data with a trained model; fairseq-interactive: Translate ...

See raw diff - Hugging Face

While this +model architecture is not very good, we can use the :ref:`fairseq-generate` script to +generate translations and compute our BLEU score over...

Analysis of State of the Art Deep Learning based Techniques ...

art results in many language understanding tasks [65]. ... To evaluate both model configurations, fairseq-generate was used to generate the pre-.