question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Preprocessing before using `model.encode(sentences)`

See original GitHub issue

First, I’m a huge fan of this project and I use it very actively - thanks guys!

To my question.

1) Cased vs. uncased Assume I running the following code:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('stsb-bert-base')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

Since the underlying pre-trained model is uncased but the sentences are cased, is this code doing some “uncasing” under the hood?

2) Text preprocessing More generally, when I preprocess my sentences (lemmatize, remove stop words, remove special characters, …), I tend to get different results compared to when I don’t. As a rule-of-thumb, would you recommend that I preprocess the sentences myself (perhaps for consistency across methods) or should I leave this to your under-the-hood preprocessing because you shouldn’t, say, remove stop words?

Looking forward to your answers!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
nreimerscommented, Apr 2, 2021
  1. Yes, the text is lower cased before processing for uncased models

  2. Specific preprocessing is not needed. Only when you have a lot of noise in your sentences, like URLs, it can make sense to remove them. Lemmatization and stop word removal is not needed

0reactions
muhlbachcommented, Apr 8, 2021

Thanks, this was really helpful!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Preprocess - Hugging Face
The main tool for preprocessing textual data is a tokenizer. A tokenizer splits text into tokens according to a set of rules. The...
Read more >
A Guide to Text Preprocessing Using BERT
This blog discuss about how to use SOTA BERT for pre-processing the textual data ... There is a preprocessing model for each BERT...
Read more >
Preprocessing NLP - Tutorial to quickly clean up a text
Today I share with you this NLP Preprocessing tutorial to see in detail how to efficiently clean up your text data !
Read more >
Making BERT Easier with Preprocessing Models From ...
The preprocessing computation can be run asynchronously on a dataset using tf. data.
Read more >
Preprocess · Issue #163 · UKPLab/sentence-transformers
But be careful. The different models use different tokenization strategies and have different vocabs. So if you pre-process your dataset with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found