question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Example script for SQuAD question answering unable to reproduce the claimed performance

See original GitHub issue

🐛 Bug

Information

The example script for SQuAD question answering (examples/question-answering/run-squad.py) fails to produce the correct results as claimed in the tutorial. The correct performance is around f1 = 88.52, exact_match = 81.22 on SQuAD v1.1, but the script produces f1 = 81.97 and exact match = 73.80 instead.

To reproduce

Steps to reproduce the behavior:

  1. Install with the latest commit (a34a989)
  2. Download the SQuAD v1.1 dataset.
  3. Run examples/question-answering/run-squad.py. with the exact same arguments as seen in the tutorial.
export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

Following is the final result.

05/24/2020 16:10:09 - INFO - main - ***** Running evaluation ***** 05/24/2020 16:10:09 - INFO - main - Num examples = 10789 05/24/2020 16:10:09 - INFO - main - Batch size = 8 Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1349/1349 [01:31<00:00, 14.81it/s] 05/24/2020 16:11:41 - INFO - main - Evaluation done in total 91.079697 secs (0.008442 sec per example) 05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing predictions to: out-noamp/predictions_.json 05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing nbest to: out-noamp/nbest_predictions_.json 05/24/2020 16:12:09 - INFO - main - Results: {‘exact’: 73.80321665089878, ‘f1’: 81.96651715123286, ‘total’: 10570, ‘HasAns_exact’: 73.80321665089878, ‘HasAns_f1’: 81.96651715123286, ‘HasAns_total’: 10570, ‘best_exact’: 73.80321665089878, ‘best_exact_thresh’: 0.0, ‘best_f1’: 81.96651715123286, ‘best_f1_thresh’: 0.0}

Expected behavior

The script should produce f1 = 88.52, exact_match = 81.22.

Environment info

  • transformers version: 2.10.0
  • Platform: Linux-4.15.0-99-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.7
  • PyTorch version (GPU?): 1.5.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: False

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
kaniblucommented, May 24, 2020

Found the bug. --do_lower_case was missing in the script arguments.

Now the results are pretty close to the ones mentioned in the tutorial.

05/24/2020 23:50:04 - INFO - main - Results: {‘exact’: 80.26490066225166, ‘f1’: 88.01726518927101, ‘total’: 10570, ‘HasAns_exact’: 80.26490066225166, ‘HasAns_f1’: 88.01726518927101, ‘HasAns_total’: 10570, ‘best_exact’: 80.26490066225166, ‘best_exact_thresh’: 0.0, ‘best_f1’: 88.01726518927101, ‘best_f1_thresh’: 0.0}

0reactions
julien-ccommented, May 27, 2020

Closing this b/c #4245 was merged

(we still need to investigate why the lowercasing is not properly populated by the model’s config)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Question answering - Hugging Face
Abstractive: generate an answer from the context that correctly answers the question. This guide will show you how to: Finetune DistilBERT on the...
Read more >
Adapting and evaluating a deep learning language model for ...
To adapt and evaluate a deep learning language model for answering why-questions based on patient-specific clinical text. Materials and Methods. Bidirectional ...
Read more >
CS 224N Default Final Project: Question Answering on ...
This year's project will use the updated version of SQuAD, named SQuAD 2.0 [2], which extends the original dataset with unanswerable questions.
Read more >
2022 Publication 17 - IRS
The explanations and examples in this publication reflect the interpretation by the ... child must file an income tax return but can't file....
Read more >
What Makes Reading Comprehension Questions Easier? | DeepAI
Figure 1: Example from the SQuAD dataset Rajpurkar et al. (2016) . The baseline system can answer the token-limited question and, even if...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found