Electra Models have bad performance
See original GitHub issueHi, this issue is in connection with #193 …
I trained german sentence transformers on our brand new german electra model german-nlp-group/electra-base-german-uncased
(1) (2) but have relativly bad results compared to other german models. But we showed that our new electra model is better on downstream tasks (germeval18 and 17).
I train on a german xnli dataset. dbmdz/bert-base-german-uncased
reaches 0.731166 COSINE spearman while the electra model barely reaches 0.6 - both are optimized with optuna over several steps.
What might be the reason? Do you have a theory? AFAIK the last layer of BERT and Electra are identical. So I have no idea why it differs so much.
(1): through a bug it does not show up on HF yet - see here https://github.com/huggingface/transformers/issues/6495 (2): model card is here: https://github.com/German-NLP-Group/german-transformer-training/blob/master/model_cards/electra-base-german-uncased.md
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (10 by maintainers)
Hi @PhilipMay I think sampling for 1) is quite important. Out-of-the-box, with round-robin, the results are rather bad.
I would like to join, but I am afraid that I do not really have time. I am currently working on several other research projects, so I sadly don’t have time for further projects.
But looking forward to your experiences.
Hi @PhilipMay I haven’t evaluated what is better: 1) multi-task training on NLI+STS or 2) first NLI and then STS training.
This script does 1), while I personally use 2) in my experiment. But I never evaluated what approach is better. https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py
Edit: As the dataset sizes between NLI and STS are extremely different, I found that 2) makes a bit more sense. Otherwise, for 1), you need to figure out how to deal with the different dataset sizes. Currently, it does round-robin, i.e. NLI and STS get both 50%. But this could quickly lead to an overfitting on the small STS training set while for NLI, you have only seen a fraction of the samples.
Out-of-the box, the NLI+STS multi task setup leads to worse scores than approach 2) where you first fine tune on NLI, and then on STS.