question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XQUAD results reproducability for mBERT

See original GitHub issue

Hi, thanks for the benchmark and the accompanied code!

I am trying to replicate XQUAD scores from the (XTREME) paper using this repo’s code. I run mBert cased model with default parameters and strictly follow the instruction in the README file.

However, the results for some languages are much lower than the scores from the paper. In particular for vi and th the gap is two-fold. There is also a significant drop for hi and el. The e.g. en, es, de result, on the other hand, is comparable.

Below I provide a table with scores that I just obtained from running the code together with the corresponding numbers from the paper. @sebastianruder, could I please ask you to take a look at it.

paper: {"f1", "exact_match"}

XQuAD 

  en {"exact_match": 71.76470588235294, "f1": 83.86480699632085} paper: 83.5 / 72.2 
  es {"exact_match": 53.94957983193277, "f1": 73.27239623706365} paper: 75.5 / 56.9 
  de {"exact_match": 52.35294117647059, "f1": 69.47398743963343} paper: 70.6 / 54.0 
  el {"exact_match": 33.61344537815126, "f1": 48.94642083187724} paper: 62.6 / 44.9 
  ru {"exact_match": 52.10084033613445, "f1": 69.82661430981189} paper: 71.3 / 53.3
  tr {"exact_match": 32.35294117647059, "f1": 46.14441800236999} paper: 55.4 / 40.1
  ar {"exact_match": 42.52100840336134, "f1": 59.72583892569921} paper: 61.5 / 45.1 
  vi {"exact_match": 15.210084033613445, "f1": 33.112047090752164} paper: 69.5 / 49.6 
  th {"exact_match": 15.294117647058824, "f1": 24.87707204093759} paper: 42.7 / 33.5 
  zh {"exact_match": 48.99159663865546, "f1": 58.654625486558196} paper: 58.0 / 48.3 
  hi {"exact_match": 22.436974789915965, "f1": 38.31058195464005} paper: 59.2 / 46.0 

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
Liangtaiwancommented, May 13, 2020

@maksym-del, @sebastianruder If you use the scripts/train_qa.sh and scripts/predict_qa.sh, you should remove --do_lower_case argument by yourself. After removing the argument, I can get the results almost the same as the performance on paper.

line 53 and line 63 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/train_qa.sh#L50-L71 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/predict_qa.sh#L59-L66

1reaction
sebastianrudercommented, Apr 16, 2020

Hi Max, Thanks for your interest. For training BERT models on the QA tasks, we actually used the original BERT codebase as that was faster with Google infrastructure (see Appendix B in the paper). I’ll check that the same results can be obtained with Transformers and will get back to you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reproducibility Issues for BERT-based Evaluation Metrics
In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims ......
Read more >
NLP Newsletter #10 [EN]: Improving Reproducibility in ML ...
The paper also provides baseline results using state-of-the-art models for multilingual representation such as mBERT, XLM, and MMTE.
Read more >
The Impact of Cross-Lingual Adjustment of ... - OpenReview
Figure 1: Histograms of cosine similarities between contextualized word representations produced by mBERT for. 20,000 randomly sampled (unrelated) vs.
Read more >
Synthetic Data Augmentation for Zero-Shot Cross-Lingual ...
We report the main results of our experiments on. XQuAD and MLQA in Table 3. The scores corre- spond to the average over...
Read more >
Assessing BERT-Style Models' Abilities to Learn the Number ...
The results show that BERT-style models are capable of implicitly learning the number of a subject both in Lithuanian and English.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found