Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Questions about distiluse-base-multilingual-cased and make_multilingual.py

See original GitHub issue

HI @nreimers,

Congratulations for the library about sentence embeddings. It’s quite useful, but I am a rookie in NLP and I feel a bit overwhelmed, so sorry in advance if I ask silly questions.

I have several doubts about the pre-trained multilingual models and the make_multilingual.py file.

Firstly about the pre-trained multilingual models:

How many languages can be used by distiluse-base-multilingual-cased, xlm-r-distilroberta-base-paraphrase-v1 and xlm-r-bert-base-nli-stsb-mean-tokens? I asked it because in the docs for the distiluse-base-multilingual-cased (just as an example) say that “While the original mUSE model only supports 16 languages, this multilingual knowledge distilled version supports 50+ languages.” Do you mean that it can already be used for more than 50 languages , or that it supports the mUSE languages but has the possibility to be extended to other languages using make_multilingual.py?
If I want to fine-tune one of these pre-trained multilingual models, I have to train the model in a new task? I am planning to use STSb to train distiluse-base-multilingual-cased but I don’t know if it is a good a idea.

Secondly about the make_multilingual.py:

Can I used it to expand the languages used in the above pre-trained multilingual models?
Does the student have to be a multilingual model?
When you use a multilingual model as the student model, does it lose its initial capabilities? For example in the code you use xlm-roberta-base (who can use 100 languages) to imitate bert-base-nli-stsb-mean-tokens' in 6 languages. Is the resulting model still useful for the initial taks that could do over the 100 languagues?
Can I use this code for fine-tuning a pre-trained mutlilingual model?
I have tried to run the code without changes in Google Colab using a GPU and it give an RunError for the RAM used. Is this normal? (I mean, is this code created in order to be used in Google Colab?)

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

AlhelyGLcommented, Nov 25, 2020

@nreimers Hi, do you mean the 53 languages listed in

“We used the following languages for Multilingual Knowledge Distillation: ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.”

This may be a silly question sorry but I don’t see english, does distiluse-base-multilingual-cased-v2 supports english?

0reactions

AlhelyGLcommented, Nov 25, 2020

@nreimers Thank you! sorry if that was too silly I wanted to make sure.

Top Results From Across the Web

sentence-transformers/distiluse-base-multilingual-cased-v2

This is a sentence-transformers model: It maps sentences & paragraphs to a 512 dimensional dense vector space and can be used for tasks...

Pretrained Models — Sentence-Transformers documentation

These models find semantically similar sentences within one language or across languages: distiluse-base-multilingual-cased-v1: Multilingual knowledge distilled ...

how to fine-tune "distiluse-base-multilingual-cased" model ...

how to fine-tune "distiluse-base-multilingual-cased" model for text similarity customisation · Ask Question. Asked 1 year, 3 months ago.

Multilingual Text Similarity Matching using Embedding

Symmetric semantic search focuses on finding similar questions from a corpus ... distiluse-base-multilingual-cased-v1 — Multi-Lingual model of Universal ...

sentence-transformers 0.3.1

training_nli.py fine-tunes BERT (and other transformer models) from the ... distiluse-base-multilingual-cased: Supported languages: Arabic, Chinese, Dutch, ...