Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128
See original GitHub issueHi,
is there any pretrained multilingual model (for sentence embedding) with a Max Sequence Length > 128 (e.g. 256 or 512)?
distiluse-base-multilingual-cased-v1
, distiluse-base-multilingual-cased-v2
, paraphrase-multilingual-MiniLM-L12-v2
and paraphrase-multilingual-mpnet-base-v2
all have a Max Sequence Length of 128 tokens
Thanks
Issue Analytics
- State:
- Created a year ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Computing Sentence Embeddings
Input Sequence Length¶. Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input...
Read more >bert/README.md at master · google-research/bert - GitHub
Therefore, one good recipe is to pre-train for, say, 90,000 steps with a sequence length of 128 and then for 10,000 additional steps...
Read more >sentence-transformers/all-MiniLM-L6-v2 - Hugging Face
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the ......
Read more >arXiv:2007.01852v2 [cs.CL] 8 Mar 2022
We sys- tematically investigate methods for learning multilingual sentence embeddings by combin- ing the best methods for learning monolin- gual ...
Read more >Easily get high-quality embeddings with SentenceTransformers!
For example, the paraphrase-multilingual-mpnet-base-v2 model has a max sequence length of 128. This means for any input sentences with more than 128 tokens ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
You can try to use them with longer inputs, but the quality is unclear as they have not been trained for longer inputs.
Otherwise we currently work on multilingual models for longer inputs
See: https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length
Averaging: This works if you have a few vectors and they are on the same topic. It doesn’t work for many vectors or if they are on different topics