Reproducing deepset/roberta-base-squad2 metrics on HuggingFace
See original GitHub issueDescribe the bug
This is not really a bug, but an investigation as to why the metrics reported for deepset/roberta-base-squad2
on HuggingFace are not reproducible using the FARMReader.eval_on_file
function in Haystack version 1.6.0.
Metrics on Huggingface
"exact": 79.87029394424324,
"f1": 82.91251169582613,
"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945
Expected behavior The following code should produce metrics similar to the ones reported on HuggingFace
from haystack.nodes import FARMReader
reader = FARMReader(
model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386
)
metrics = reader.eval_on_file(
data_dir='./data/',
test_filename='dev-v2.0.json',
device='cuda',
)
Data for eval is available here (official SQuAD2 dev set): https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Currently, Haystack v1.6.0 produces
"v1.6.0": {
"EM": 76.13913922344815,
"f1": 78.1999299751871,
"top_n_accuracy": 97.45641371178304,
"top_n": 4,
"EM_text_answer": 61.52159244264508,
"f1_text_answer": 65.64908377115323,
"top_n_accuracy_text_answer": 94.90553306342781,
"top_n_EM_text_answer": 81.34278002699055,
"top_n_f1_text_answer": 90.69119695344651,
"Total_text_answer": 5928,
"EM_no_answer": 90.71488645920942,
"f1_no_answer": 90.71488645920942,
"top_n_accuracy_no_answer": 100.0,
"Total_no_answer": 5945
},
and we can immediately see that the EM (79.9 vs. 76.1) and F1 (82.9 vs. 78.2) scores are lower than expected.
If we roll back to Haystack v1.3.0 we get the following metrics
"v1.3.0": {
"EM": 0.7841320643476796,
"f1": 0.826529420882385,
"top_n_accuracy": 0.9741430135601785
}
which much more closely match the ones reported on HuggingFace.
Identifying the change in Haystack
I believe the code change that caused this difference in results was introduced by the solution to this issue https://github.com/deepset-ai/haystack/issues/2410. To check this I updated FARMReader.eval_on_file
to accept the argument use_no_answer_legacy_confidence
(which was originally added in this PR https://github.com/deepset-ai/haystack/pull/2414/files for FARMReader.eval
but not FARMReader.eval_on_file
) in the branch https://github.com/deepset-ai/haystack/tree/issue_farmreader_eval and ran the following code
results = reader.eval_on_file(
data_dir='./data/',
test_filename='dev-v2.0.json',
device='cuda',
use_no_answer_legacy_confidence=True
)
which produced
{'EM': 78.37951655015581,
'f1': 82.61101723722291,
'top_n_accuracy': 97.41430135601786,
'top_n': 4,
'EM_text_answer': 75.0,
'f1_text_answer': 83.47513624452559,
'top_n_accuracy_text_answer': 94.82118758434548,
'top_n_EM_text_answer': 81.00539811066126,
'top_n_f1_text_answer': 90.50564293271806,
'Total_text_answer': 5928,
'EM_no_answer': 81.74936921783012,
'f1_no_answer': 81.74936921783012,
'top_n_accuracy_no_answer': 100.0,
'Total_no_answer': 5945}
The EM and F1 scores now show the expected values.
Solution
I’m not entirely sure how this should be resolved, but it seems the no_answer
logic used in eval should probably be somehow linked to the models that were trained using that logic.
[ ] Consider saving the value ofThis would not solve the issue as discussed below.use_no_answer_legacy_confidence
with model meta data so when the model is reloaded it uses the sameno_answer
confidence logic as it was trained with.
FAQ Check
- Have you had a look at our new FAQ page?
System:
- OS: ubuntu
- GPU/CPU: GPU
- Haystack version (commit or version number): v1.6.0
Issue Analytics
- State:
- Created a year ago
- Comments:16 (16 by maintainers)
Top GitHub Comments
After some discussion with @Timoeller, @ju-gu, @mathislucka and @tstadel we agreed that we would like the
FARMReader.eval
andFARMReader.eval_on_file
functions to behave the same way as theFARMReader.predict
functions.So this means we will remove the
use_no_answer_legacy_confidence
, anduse_confidence_scores_for_ranking
options from theeval
andeval_on_file
functions and instead use the class variables of FARMReader to determine these values.Just a quick status from my side. I’m still investigating: my goal is to have a clear view on what this means for:
If we have that let’s focus on what needs to be fixed or should be parameterized in any way.