Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Preprocessing before using `model.encode(sentences)`

See original GitHub issue

First, I’m a huge fan of this project and I use it very actively - thanks guys!

To my question.

1) Cased vs. uncased Assume I running the following code:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('stsb-bert-base')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

Since the underlying pre-trained model is uncased but the sentences are cased, is this code doing some “uncasing” under the hood?

2) Text preprocessing More generally, when I preprocess my sentences (lemmatize, remove stop words, remove special characters, …), I tend to get different results compared to when I don’t. As a rule-of-thumb, would you recommend that I preprocess the sentences myself (perhaps for consistency across methods) or should I leave this to your under-the-hood preprocessing because you shouldn’t, say, remove stop words?

Looking forward to your answers!

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

nreimerscommented, Apr 2, 2021

Yes, the text is lower cased before processing for uncased models
Specific preprocessing is not needed. Only when you have a lot of noise in your sentences, like URLs, it can make sense to remove them. Lemmatization and stop word removal is not needed

0reactions

muhlbachcommented, Apr 8, 2021

Thanks, this was really helpful!