Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reproducing deepset/roberta-base-squad2 metrics on HuggingFace

See original GitHub issue

Describe the bug This is not really a bug, but an investigation as to why the metrics reported for deepset/roberta-base-squad2 on HuggingFace are not reproducible using the FARMReader.eval_on_file function in Haystack version 1.6.0.

Metrics on Huggingface

"exact": 79.87029394424324,
"f1": 82.91251169582613,

"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945

Expected behavior The following code should produce metrics similar to the ones reported on HuggingFace

from haystack.nodes import FARMReader
reader = FARMReader(
  model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386
)
metrics = reader.eval_on_file(
    data_dir='./data/',
    test_filename='dev-v2.0.json',
    device='cuda',
)

Data for eval is available here (official SQuAD2 dev set): https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Currently, Haystack v1.6.0 produces

    "v1.6.0": {
      "EM": 76.13913922344815,
      "f1": 78.1999299751871,
      "top_n_accuracy": 97.45641371178304,
      "top_n": 4,
      "EM_text_answer": 61.52159244264508,
      "f1_text_answer": 65.64908377115323,
      "top_n_accuracy_text_answer": 94.90553306342781,
      "top_n_EM_text_answer": 81.34278002699055,
      "top_n_f1_text_answer": 90.69119695344651,
      "Total_text_answer": 5928,
      "EM_no_answer": 90.71488645920942,
      "f1_no_answer": 90.71488645920942,
      "top_n_accuracy_no_answer": 100.0,
      "Total_no_answer": 5945
    },

and we can immediately see that the EM (79.9 vs. 76.1) and F1 (82.9 vs. 78.2) scores are lower than expected.

If we roll back to Haystack v1.3.0 we get the following metrics

    "v1.3.0": {
      "EM": 0.7841320643476796,
      "f1": 0.826529420882385,
      "top_n_accuracy": 0.9741430135601785
    }

which much more closely match the ones reported on HuggingFace.

Identifying the change in Haystack I believe the code change that caused this difference in results was introduced by the solution to this issue https://github.com/deepset-ai/haystack/issues/2410. To check this I updated FARMReader.eval_on_file to accept the argument use_no_answer_legacy_confidence (which was originally added in this PR https://github.com/deepset-ai/haystack/pull/2414/files for FARMReader.eval but not FARMReader.eval_on_file) in the branch https://github.com/deepset-ai/haystack/tree/issue_farmreader_eval and ran the following code

results = reader.eval_on_file(
    data_dir='./data/',
    test_filename='dev-v2.0.json',
    device='cuda',
    use_no_answer_legacy_confidence=True
)

which produced

{'EM': 78.37951655015581,
 'f1': 82.61101723722291,
 'top_n_accuracy': 97.41430135601786,
 'top_n': 4,
 'EM_text_answer': 75.0,
 'f1_text_answer': 83.47513624452559,
 'top_n_accuracy_text_answer': 94.82118758434548,
 'top_n_EM_text_answer': 81.00539811066126,
 'top_n_f1_text_answer': 90.50564293271806,
 'Total_text_answer': 5928,
 'EM_no_answer': 81.74936921783012,
 'f1_no_answer': 81.74936921783012,
 'top_n_accuracy_no_answer': 100.0,
 'Total_no_answer': 5945}

The EM and F1 scores now show the expected values.

Solution I’m not entirely sure how this should be resolved, but it seems the no_answer logic used in eval should probably be somehow linked to the models that were trained using that logic.

~~[ ] Consider saving the value of use_no_answer_legacy_confidence with model meta data so when the model is reloaded it uses the same no_answer confidence logic as it was trained with.~~ This would not solve the issue as discussed below.

FAQ Check

Have you had a look at our new FAQ page?

System:

OS: ubuntu
GPU/CPU: GPU
Haystack version (commit or version number): v1.6.0

Issue Analytics

State:
Created a year ago
Comments:16 (16 by maintainers)

Top GitHub Comments

1reaction

sjrlcommented, Aug 9, 2022

After some discussion with @Timoeller, @ju-gu, @mathislucka and @tstadel we agreed that we would like the FARMReader.eval and FARMReader.eval_on_file functions to behave the same way as the FARMReader.predict functions.

So this means we will remove the use_no_answer_legacy_confidence, and use_confidence_scores_for_ranking options from the eval and eval_on_file functions and instead use the class variables of FARMReader to determine these values.

1reaction

tstadelcommented, Jul 21, 2022

Just a quick status from my side. I’m still investigating: my goal is to have a clear view on what this means for:

Training
Eval
Inference in Haystack

If we have that let’s focus on what needs to be fixed or should be parameterized in any way.

Top Results From Across the Web

Metrics - Hugging Face

Metrics are important for evaluating a model's predictions. In the tutorial, you learned how to compute a metric over an entire evaluation set....

Evaluate predictions - Hugging Face

In this section of the tutorials, you will load a metric and use it to evaluate your models predictions. You can see what...

A quick tour - Hugging Face

Metric : A metric is used to evaluate a model's performance and usually involves the model's predictions as well as some ground truth...

The AI community building the future. - Hugging Face

BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT...

Using a Metric — datasets 1.11.0 documentation - Hugging Face

Using a Metric¶. Evaluating a model's predictions with datasets.Metric involves just a couple of methods: datasets.Metric.add() and datasets.