Interpretation of results generated by “fairseq-generate”
See original GitHub issueFairseq-generate
Produces results in a form of a text file with information available in numerical form. My question is - what does the S, D, T, H
stand for? what does the 2 float values represent? I had used —evaluate-bleu
Flag, but I do no see my BLEU scores anywhere. Can anyone explain what do they represent?
Also, I would recommend updating the docs on such information as to aid people who would have this issue in the future. Right now, the docs aren’t very great which indirectly leads to a large amount of issue has people aren’t able to understand the correct options to use.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:7 (1 by maintainers)
Top Results From Across the Web
Interpretation of results generated by “fairseq-generate” #3000
Fairseq-generate Produces results in a form of a text file with information available in numerical form. My question is - what does the...
Read more >How to interpret the P numbers that fairseq generate produces?
Using fairseq-generate.py, with the transformer architecture, each translation produces a section like this: Why is it rare to discover new ...
Read more >Command-line Tools — fairseq 0.12.2 documentation
fairseq-train: Train a new model on one or multiple GPUs; fairseq-generate: Translate pre-processed data with a trained model; fairseq-interactive: Translate ...
Read more >See raw diff - Hugging Face
While this +model architecture is not very good, we can use the :ref:`fairseq-generate` script to +generate translations and compute our BLEU score over...
Read more >Analysis of State of the Art Deep Learning based Techniques ...
art results in many language understanding tasks [65]. ... To evaluate both model configurations, fairseq-generate was used to generate the pre-.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The different lines you get (all examples are for a French-to-English model):
S-
is the Source sentence the model has to translate (in our example it could be:S-0 Bonjour, mon nom est Albert Einstein.
)T-
is the Target (or reference or “gold”) sentence you provided for this source, the one you want to compare to (in our example, it’s the “official” human translated English version:T-0 Hello, my name is Albert Einstein.
). It’s possible to runfairseq-generate
without target sentences, in which case this line won’t appear.H-
is the tokenized Hypothesis (or system) translation (i.e. the tokens generated by your model) along with its score. If your model works with sub-word tokens (ex: BPE), this line will be sub-words tokens, separated by spaces. Even if your model works with whole words but considers punctuation symbols as tokens, they will be space-separated. For example, you might obtain something like that with BPE and Moses tokenization (this hypothesis has a score of 0.654):H-0 -0.654 Hi , my name is Alb@@ ert Ein@@ st@@ ein .
.D-
is the same asH-
but Detokenized (after applying BPE and word tokenization in reverse). For example, you would haveD-0 -0.654 Hi, my name is Albert Einstein.
In those example, the number after the letter (ex:
0
inH-0
) is the ID of the sample which, from what I understand, seems to simply be its index. Translations for sentences in your test set might not be generated sequentially, so you can use this ID to reorder results.As for
--evaluate-bleu
, it doesn’t seem to be a valid argument anymore. There is though a--eval-bleu
in the translation task, but it might only be used during training to calculate a BLEU score on the validation set.From your
fairseq-generate
output file, you could calculate a BLEU score like so:First keep only target sentences (
T-
lines) and generated translations (D-
lines) from your output file (ex:gen.out
) and separates them in two files:Then run
fairseq-score
(see arguments here):You would get something like
Hi, @xfrenette! Thanks for your detailed explanation! It really resolved most of my issues. But I still have 2 questions as follows:
At the end of your explanation:
How do we interpret the last 4 numbers (67.5/46.9/34.4/25.5)?
I guess they represent BLEU2/3/5/6 instead of BLEU4 since they are decreasing. But I’m not very sure about it.
Here is one more line at last starting with
P
(probably because of the updates forfairseq-generate
):Do they stand for $\log_2$
P
robability of beam search for each subwords?Any responses are highly appreciated. Thanks!