XQUAD results reproducability for mBERT
See original GitHub issueHi, thanks for the benchmark and the accompanied code!
I am trying to replicate XQUAD scores from the (XTREME) paper using this repo’s code.
I run mBert cased
model with default parameters and strictly follow the instruction in the README file.
However, the results for some languages are much lower than the scores from the paper.
In particular for vi
and th
the gap is two-fold. There is also a significant drop for hi
and el
.
The e.g. en, es, de
result, on the other hand, is comparable.
Below I provide a table with scores that I just obtained from running the code together with the corresponding numbers from the paper. @sebastianruder, could I please ask you to take a look at it.
paper: {"f1", "exact_match"}
XQuAD
en {"exact_match": 71.76470588235294, "f1": 83.86480699632085} paper: 83.5 / 72.2
es {"exact_match": 53.94957983193277, "f1": 73.27239623706365} paper: 75.5 / 56.9
de {"exact_match": 52.35294117647059, "f1": 69.47398743963343} paper: 70.6 / 54.0
el {"exact_match": 33.61344537815126, "f1": 48.94642083187724} paper: 62.6 / 44.9
ru {"exact_match": 52.10084033613445, "f1": 69.82661430981189} paper: 71.3 / 53.3
tr {"exact_match": 32.35294117647059, "f1": 46.14441800236999} paper: 55.4 / 40.1
ar {"exact_match": 42.52100840336134, "f1": 59.72583892569921} paper: 61.5 / 45.1
vi {"exact_match": 15.210084033613445, "f1": 33.112047090752164} paper: 69.5 / 49.6
th {"exact_match": 15.294117647058824, "f1": 24.87707204093759} paper: 42.7 / 33.5
zh {"exact_match": 48.99159663865546, "f1": 58.654625486558196} paper: 58.0 / 48.3
hi {"exact_match": 22.436974789915965, "f1": 38.31058195464005} paper: 59.2 / 46.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (4 by maintainers)
Top Results From Across the Web
Reproducibility Issues for BERT-based Evaluation Metrics
In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims ......
Read more >NLP Newsletter #10 [EN]: Improving Reproducibility in ML ...
The paper also provides baseline results using state-of-the-art models for multilingual representation such as mBERT, XLM, and MMTE.
Read more >The Impact of Cross-Lingual Adjustment of ... - OpenReview
Figure 1: Histograms of cosine similarities between contextualized word representations produced by mBERT for. 20,000 randomly sampled (unrelated) vs.
Read more >Synthetic Data Augmentation for Zero-Shot Cross-Lingual ...
We report the main results of our experiments on. XQuAD and MLQA in Table 3. The scores corre- spond to the average over...
Read more >Assessing BERT-Style Models' Abilities to Learn the Number ...
The results show that BERT-style models are capable of implicitly learning the number of a subject both in Lithuanian and English.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@maksym-del, @sebastianruder If you use the scripts/train_qa.sh and scripts/predict_qa.sh, you should remove --do_lower_case argument by yourself. After removing the argument, I can get the results almost the same as the performance on paper.
line 53 and line 63 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/train_qa.sh#L50-L71 https://github.com/google-research/xtreme/blob/5d7e46217397797f287a324c8a1d75857e592885/scripts/predict_qa.sh#L59-L66
Hi Max, Thanks for your interest. For training BERT models on the QA tasks, we actually used the original BERT codebase as that was faster with Google infrastructure (see Appendix B in the paper). I’ll check that the same results can be obtained with Transformers and will get back to you.