question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128

See original GitHub issue

Hi,

is there any pretrained multilingual model (for sentence embedding) with a Max Sequence Length > 128 (e.g. 256 or 512)?

distiluse-base-multilingual-cased-v1, distiluse-base-multilingual-cased-v2, paraphrase-multilingual-MiniLM-L12-v2 and paraphrase-multilingual-mpnet-base-v2 all have a Max Sequence Length of 128 tokens

Thanks

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
nreimerscommented, Mar 22, 2022

You can try to use them with longer inputs, but the quality is unclear as they have not been trained for longer inputs.

Otherwise we currently work on multilingual models for longer inputs

2reactions
nreimerscommented, Mar 22, 2022

See: https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

Averaging: This works if you have a few vectors and they are on the same topic. It doesn’t work for many vectors or if they are on different topics

Read more comments on GitHub >

github_iconTop Results From Across the Web

Computing Sentence Embeddings
Input Sequence Length¶. Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input...
Read more >
bert/README.md at master · google-research/bert - GitHub
Therefore, one good recipe is to pre-train for, say, 90,000 steps with a sequence length of 128 and then for 10,000 additional steps...
Read more >
sentence-transformers/all-MiniLM-L6-v2 - Hugging Face
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the ......
Read more >
arXiv:2007.01852v2 [cs.CL] 8 Mar 2022
We sys- tematically investigate methods for learning multilingual sentence embeddings by combin- ing the best methods for learning monolin- gual ...
Read more >
Easily get high-quality embeddings with SentenceTransformers!
For example, the paraphrase-multilingual-mpnet-base-v2 model has a max sequence length of 128. This means for any input sentences with more than 128 tokens ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found