Preprocessing before using `model.encode(sentences)`
See original GitHub issueFirst, I’m a huge fan of this project and I use it very actively - thanks guys!
To my question.
1) Cased vs. uncased Assume I running the following code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('stsb-bert-base')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)
Since the underlying pre-trained model is uncased but the sentences are cased, is this code doing some “uncasing” under the hood?
2) Text preprocessing More generally, when I preprocess my sentences (lemmatize, remove stop words, remove special characters, …), I tend to get different results compared to when I don’t. As a rule-of-thumb, would you recommend that I preprocess the sentences myself (perhaps for consistency across methods) or should I leave this to your under-the-hood preprocessing because you shouldn’t, say, remove stop words?
Looking forward to your answers!
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Preprocess - Hugging Face
The main tool for preprocessing textual data is a tokenizer. A tokenizer splits text into tokens according to a set of rules. The...
Read more >A Guide to Text Preprocessing Using BERT
This blog discuss about how to use SOTA BERT for pre-processing the textual data ... There is a preprocessing model for each BERT...
Read more >Preprocessing NLP - Tutorial to quickly clean up a text
Today I share with you this NLP Preprocessing tutorial to see in detail how to efficiently clean up your text data !
Read more >Making BERT Easier with Preprocessing Models From ...
The preprocessing computation can be run asynchronously on a dataset using tf. data.
Read more >Preprocess · Issue #163 · UKPLab/sentence-transformers
But be careful. The different models use different tokenization strategies and have different vocabs. So if you pre-process your dataset with ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, the text is lower cased before processing for uncased models
Specific preprocessing is not needed. Only when you have a lot of noise in your sentences, like URLs, it can make sense to remove them. Lemmatization and stop word removal is not needed
Thanks, this was really helpful!