Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Electra Models have bad performance

See original GitHub issue

Hi, this issue is in connection with #193 …

I trained german sentence transformers on our brand new german electra model german-nlp-group/electra-base-german-uncased (1) (2) but have relativly bad results compared to other german models. But we showed that our new electra model is better on downstream tasks (germeval18 and 17).

I train on a german xnli dataset. dbmdz/bert-base-german-uncased reaches 0.731166 COSINE spearman while the electra model barely reaches 0.6 - both are optimized with optuna over several steps.

What might be the reason? Do you have a theory? AFAIK the last layer of BERT and Electra are identical. So I have no idea why it differs so much.

(1): through a bug it does not show up on HF yet - see here https://github.com/huggingface/transformers/issues/6495 (2): model card is here: https://github.com/German-NLP-Group/german-transformer-training/blob/master/model_cards/electra-base-german-uncased.md

Issue Analytics

State:
Created 3 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

nreimerscommented, Aug 26, 2020

Hi @PhilipMay I think sampling for 1) is quite important. Out-of-the-box, with round-robin, the results are rather bad.

I would like to join, but I am afraid that I do not really have time. I am currently working on several other research projects, so I sadly don’t have time for further projects.

But looking forward to your experiences.

1reaction

nreimerscommented, Aug 24, 2020

Hi @PhilipMay I haven’t evaluated what is better: 1) multi-task training on NLI+STS or 2) first NLI and then STS training.

This script does 1), while I personally use 2) in my experiment. But I never evaluated what approach is better. https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py

Edit: As the dataset sizes between NLI and STS are extremely different, I found that 2) makes a bit more sense. Otherwise, for 1), you need to figure out how to deal with the different dataset sizes. Currently, it does round-robin, i.e. NLI and STS get both 50%. But this could quickly lead to an overfitting on the small STS training set while for NLI, you have only seen a fraction of the samples.

Out-of-the box, the NLI+STS multi task setup leads to worse scores than approach 2) where you first fine tune on NLI, and then on STS.