question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reproducing deepset/roberta-base-squad2 metrics on HuggingFace

See original GitHub issue

Describe the bug This is not really a bug, but an investigation as to why the metrics reported for deepset/roberta-base-squad2 on HuggingFace are not reproducible using the FARMReader.eval_on_file function in Haystack version 1.6.0.

Metrics on Huggingface

"exact": 79.87029394424324,
"f1": 82.91251169582613,

"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945

Expected behavior The following code should produce metrics similar to the ones reported on HuggingFace

from haystack.nodes import FARMReader
reader = FARMReader(
  model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386
)
metrics = reader.eval_on_file(
    data_dir='./data/',
    test_filename='dev-v2.0.json',
    device='cuda',
)

Data for eval is available here (official SQuAD2 dev set): https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Currently, Haystack v1.6.0 produces

    "v1.6.0": {
      "EM": 76.13913922344815,
      "f1": 78.1999299751871,
      "top_n_accuracy": 97.45641371178304,
      "top_n": 4,
      "EM_text_answer": 61.52159244264508,
      "f1_text_answer": 65.64908377115323,
      "top_n_accuracy_text_answer": 94.90553306342781,
      "top_n_EM_text_answer": 81.34278002699055,
      "top_n_f1_text_answer": 90.69119695344651,
      "Total_text_answer": 5928,
      "EM_no_answer": 90.71488645920942,
      "f1_no_answer": 90.71488645920942,
      "top_n_accuracy_no_answer": 100.0,
      "Total_no_answer": 5945
    },

and we can immediately see that the EM (79.9 vs. 76.1) and F1 (82.9 vs. 78.2) scores are lower than expected.

If we roll back to Haystack v1.3.0 we get the following metrics

    "v1.3.0": {
      "EM": 0.7841320643476796,
      "f1": 0.826529420882385,
      "top_n_accuracy": 0.9741430135601785
    }

which much more closely match the ones reported on HuggingFace.

Identifying the change in Haystack I believe the code change that caused this difference in results was introduced by the solution to this issue https://github.com/deepset-ai/haystack/issues/2410. To check this I updated FARMReader.eval_on_file to accept the argument use_no_answer_legacy_confidence (which was originally added in this PR https://github.com/deepset-ai/haystack/pull/2414/files for FARMReader.eval but not FARMReader.eval_on_file) in the branch https://github.com/deepset-ai/haystack/tree/issue_farmreader_eval and ran the following code

results = reader.eval_on_file(
    data_dir='./data/',
    test_filename='dev-v2.0.json',
    device='cuda',
    use_no_answer_legacy_confidence=True
)

which produced

{'EM': 78.37951655015581,
 'f1': 82.61101723722291,
 'top_n_accuracy': 97.41430135601786,
 'top_n': 4,
 'EM_text_answer': 75.0,
 'f1_text_answer': 83.47513624452559,
 'top_n_accuracy_text_answer': 94.82118758434548,
 'top_n_EM_text_answer': 81.00539811066126,
 'top_n_f1_text_answer': 90.50564293271806,
 'Total_text_answer': 5928,
 'EM_no_answer': 81.74936921783012,
 'f1_no_answer': 81.74936921783012,
 'top_n_accuracy_no_answer': 100.0,
 'Total_no_answer': 5945}

The EM and F1 scores now show the expected values.

Solution I’m not entirely sure how this should be resolved, but it seems the no_answer logic used in eval should probably be somehow linked to the models that were trained using that logic.

  • [ ] Consider saving the value of use_no_answer_legacy_confidence with model meta data so when the model is reloaded it uses the same no_answer confidence logic as it was trained with. This would not solve the issue as discussed below.

FAQ Check

System:

  • OS: ubuntu
  • GPU/CPU: GPU
  • Haystack version (commit or version number): v1.6.0

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:16 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
sjrlcommented, Aug 9, 2022

After some discussion with @Timoeller, @ju-gu, @mathislucka and @tstadel we agreed that we would like the FARMReader.eval and FARMReader.eval_on_file functions to behave the same way as the FARMReader.predict functions.

So this means we will remove the use_no_answer_legacy_confidence, and use_confidence_scores_for_ranking options from the eval and eval_on_file functions and instead use the class variables of FARMReader to determine these values.

1reaction
tstadelcommented, Jul 21, 2022

Just a quick status from my side. I’m still investigating: my goal is to have a clear view on what this means for:

  • Training
  • Eval
  • Inference in Haystack

If we have that let’s focus on what needs to be fixed or should be parameterized in any way.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Metrics - Hugging Face
Metrics are important for evaluating a model's predictions. In the tutorial, you learned how to compute a metric over an entire evaluation set....
Read more >
Evaluate predictions - Hugging Face
In this section of the tutorials, you will load a metric and use it to evaluate your models predictions. You can see what...
Read more >
A quick tour - Hugging Face
Metric : A metric is used to evaluate a model's performance and usually involves the model's predictions as well as some ground truth...
Read more >
The AI community building the future. - Hugging Face
BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT...
Read more >
Using a Metric — datasets 1.11.0 documentation - Hugging Face
Using a Metric¶. Evaluating a model's predictions with datasets.Metric involves just a couple of methods: datasets.Metric.add() and datasets.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found