Questions about distiluse-base-multilingual-cased and make_multilingual.py
See original GitHub issueHI @nreimers,
Congratulations for the library about sentence embeddings. It’s quite useful, but I am a rookie in NLP and I feel a bit overwhelmed, so sorry in advance if I ask silly questions.
I have several doubts about the pre-trained multilingual models and the make_multilingual.py file.
Firstly about the pre-trained multilingual models:
-
How many languages can be used by
distiluse-base-multilingual-cased
,xlm-r-distilroberta-base-paraphrase-v1
andxlm-r-bert-base-nli-stsb-mean-tokens
? I asked it because in the docs for thedistiluse-base-multilingual-cased
(just as an example) say that “While the original mUSE model only supports 16 languages, this multilingual knowledge distilled version supports 50+ languages.” Do you mean that it can already be used for more than 50 languages , or that it supports the mUSE languages but has the possibility to be extended to other languages using make_multilingual.py? -
If I want to fine-tune one of these pre-trained multilingual models, I have to train the model in a new task? I am planning to use STSb to train
distiluse-base-multilingual-cased
but I don’t know if it is a good a idea.
Secondly about the make_multilingual.py:
-
Can I used it to expand the languages used in the above pre-trained multilingual models?
-
Does the student have to be a multilingual model?
-
When you use a multilingual model as the student model, does it lose its initial capabilities? For example in the code you use
xlm-roberta-base
(who can use 100 languages) to imitatebert-base-nli-stsb-mean-tokens'
in 6 languages. Is the resulting model still useful for the initial taks that could do over the 100 languagues? -
Can I use this code for fine-tuning a pre-trained mutlilingual model?
-
I have tried to run the code without changes in Google Colab using a GPU and it give an RunError for the RAM used. Is this normal? (I mean, is this code created in order to be used in Google Colab?)
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
@nreimers Hi, do you mean the 53 languages listed in
“We used the following languages for Multilingual Knowledge Distillation: ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.”
This may be a silly question sorry but I don’t see english, does
distiluse-base-multilingual-cased-v2
supports english?@nreimers Thank you! sorry if that was too silly I wanted to make sure.