Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128

See original GitHub issue

Hi,

is there any pretrained multilingual model (for sentence embedding) with a Max Sequence Length > 128 (e.g. 256 or 512)?

distiluse-base-multilingual-cased-v1, distiluse-base-multilingual-cased-v2, paraphrase-multilingual-MiniLM-L12-v2 and paraphrase-multilingual-mpnet-base-v2 all have a Max Sequence Length of 128 tokens

Thanks

Issue Analytics

State:
Created a year ago
Comments:7 (3 by maintainers)

Top GitHub Comments

3reactions

nreimerscommented, Mar 22, 2022

You can try to use them with longer inputs, but the quality is unclear as they have not been trained for longer inputs.

Otherwise we currently work on multilingual models for longer inputs

2reactions

nreimerscommented, Mar 22, 2022

See: https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

Averaging: This works if you have a few vectors and they are on the same topic. It doesn’t work for many vectors or if they are on different topics

Top Results From Across the Web

Computing Sentence Embeddings

Input Sequence Length¶. Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input...

bert/README.md at master · google-research/bert - GitHub

Therefore, one good recipe is to pre-train for, say, 90,000 steps with a sequence length of 128 and then for 10,000 additional steps...

sentence-transformers/all-MiniLM-L6-v2 - Hugging Face

The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the ......

arXiv:2007.01852v2 [cs.CL] 8 Mar 2022

We sys- tematically investigate methods for learning multilingual sentence embeddings by combin- ing the best methods for learning monolin- gual ...

Easily get high-quality embeddings with SentenceTransformers!

For example, the paraphrase-multilingual-mpnet-base-v2 model has a max sequence length of 128. This means for any input sentences with more than 128 tokens ......