Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XQUAD results reproducability for mBERT

See original GitHub issue

Hi, thanks for the benchmark and the accompanied code!

I am trying to replicate XQUAD scores from the (XTREME) paper using this repo’s code. I run mBert cased model with default parameters and strictly follow the instruction in the README file.

However, the results for some languages are much lower than the scores from the paper. In particular for vi and th the gap is two-fold. There is also a significant drop for hi and el. The e.g. en, es, de result, on the other hand, is comparable.

Below I provide a table with scores that I just obtained from running the code together with the corresponding numbers from the paper. @sebastianruder, could I please ask you to take a look at it.

paper: {"f1", "exact_match"}

XQuAD 

  en {"exact_match": 71.76470588235294, "f1": 83.86480699632085} paper: 83.5 / 72.2 
  es {"exact_match": 53.94957983193277, "f1": 73.27239623706365} paper: 75.5 / 56.9 
  de {"exact_match": 52.35294117647059, "f1": 69.47398743963343} paper: 70.6 / 54.0 
  el {"exact_match": 33.61344537815126, "f1": 48.94642083187724} paper: 62.6 / 44.9 
  ru {"exact_match": 52.10084033613445, "f1": 69.82661430981189} paper: 71.3 / 53.3
  tr {"exact_match": 32.35294117647059, "f1": 46.14441800236999} paper: 55.4 / 40.1
  ar {"exact_match": 42.52100840336134, "f1": 59.72583892569921} paper: 61.5 / 45.1 
  vi {"exact_match": 15.210084033613445, "f1": 33.112047090752164} paper: 69.5 / 49.6 
  th {"exact_match": 15.294117647058824, "f1": 24.87707204093759} paper: 42.7 / 33.5 
  zh {"exact_match": 48.99159663865546, "f1": 58.654625486558196} paper: 58.0 / 48.3 
  hi {"exact_match": 22.436974789915965, "f1": 38.31058195464005} paper: 59.2 / 46.0

Issue Analytics

State:
Created 3 years ago
Comments:10 (4 by maintainers)

Top GitHub Comments

2reactions

Liangtaiwancommented, May 13, 2020

@maksym-del, @sebastianruder If you use the scripts/train_qa.sh and scripts/predict_qa.sh, you should remove --do_lower_case argument by yourself. After removing the argument, I can get the results almost the same as the performance on paper.

line 53 and line 63 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/train_qa.sh#L50-L71 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/predict_qa.sh#L59-L66

1reaction

sebastianrudercommented, Apr 16, 2020

Hi Max, Thanks for your interest. For training BERT models on the QA tasks, we actually used the original BERT codebase as that was faster with Google infrastructure (see Appendix B in the paper). I’ll check that the same results can be obtained with Transformers and will get back to you.

Top Results From Across the Web

Reproducibility Issues for BERT-based Evaluation Metrics

In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims ......

NLP Newsletter #10 [EN]: Improving Reproducibility in ML ...

The paper also provides baseline results using state-of-the-art models for multilingual representation such as mBERT, XLM, and MMTE.

The Impact of Cross-Lingual Adjustment of ... - OpenReview

Figure 1: Histograms of cosine similarities between contextualized word representations produced by mBERT for. 20,000 randomly sampled (unrelated) vs.

Synthetic Data Augmentation for Zero-Shot Cross-Lingual ...

We report the main results of our experiments on. XQuAD and MLQA in Table 3. The scores corre- spond to the average over...

Assessing BERT-Style Models' Abilities to Learn the Number ...

The results show that BERT-style models are capable of implicitly learning the number of a subject both in Lithuanian and English.