Example script for SQuAD question answering unable to reproduce the claimed performance
See original GitHub issue🐛 Bug
Information
The example script for SQuAD question answering (examples/question-answering/run-squad.py
) fails to produce the correct results as claimed in the tutorial.
The correct performance is around f1 = 88.52, exact_match = 81.22 on SQuAD v1.1, but the script produces f1 = 81.97 and exact match = 73.80 instead.
To reproduce
Steps to reproduce the behavior:
- Install with the latest commit (a34a989)
- Download the SQuAD v1.1 dataset.
- Run
examples/question-answering/run-squad.py
. with the exact same arguments as seen in the tutorial.
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
Following is the final result.
05/24/2020 16:10:09 - INFO - main - ***** Running evaluation ***** 05/24/2020 16:10:09 - INFO - main - Num examples = 10789 05/24/2020 16:10:09 - INFO - main - Batch size = 8 Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1349/1349 [01:31<00:00, 14.81it/s] 05/24/2020 16:11:41 - INFO - main - Evaluation done in total 91.079697 secs (0.008442 sec per example) 05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing predictions to: out-noamp/predictions_.json 05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing nbest to: out-noamp/nbest_predictions_.json 05/24/2020 16:12:09 - INFO - main - Results: {‘exact’: 73.80321665089878, ‘f1’: 81.96651715123286, ‘total’: 10570, ‘HasAns_exact’: 73.80321665089878, ‘HasAns_f1’: 81.96651715123286, ‘HasAns_total’: 10570, ‘best_exact’: 73.80321665089878, ‘best_exact_thresh’: 0.0, ‘best_f1’: 81.96651715123286, ‘best_f1_thresh’: 0.0}
Expected behavior
The script should produce f1 = 88.52, exact_match = 81.22.
Environment info
transformers
version: 2.10.0- Platform: Linux-4.15.0-99-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.7
- PyTorch version (GPU?): 1.5.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: True
- Using distributed or parallel set-up in script?: False
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
Found the bug.
--do_lower_case
was missing in the script arguments.Now the results are pretty close to the ones mentioned in the tutorial.
05/24/2020 23:50:04 - INFO - main - Results: {‘exact’: 80.26490066225166, ‘f1’: 88.01726518927101, ‘total’: 10570, ‘HasAns_exact’: 80.26490066225166, ‘HasAns_f1’: 88.01726518927101, ‘HasAns_total’: 10570, ‘best_exact’: 80.26490066225166, ‘best_exact_thresh’: 0.0, ‘best_f1’: 88.01726518927101, ‘best_f1_thresh’: 0.0}
Closing this b/c #4245 was merged
(we still need to investigate why the lowercasing is not properly populated by the model’s config)