Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Preprocess

See original GitHub issue

Hello,

First of all, thank you for the great paper and package, both are really great!

Sorry if those are simple questions, but I started using BERT recently.

So, I’m using a custom triplets dataset in portuguese to fine-tune a BERT model (also in portuguese, available with huggingface) and I have a doubt about the preprocessing. Based on your script training_wikipedia_sections.py , when I use the DataLoader function, does it preprocess the texts contained on my datasets or they should already be preprocessed? Or is there another function to do that?

Also, I’m trying many variations during the training and I take almost 1 hour to load the data every time I run the script. Is there a way to save the data and load it later without using the Dataloader?

Just one more thing, is it possible to fine-tune BERT using the triplets dataset and get word embeddings instead of sentence embeddings? Maybe removing the pooling layer… I wanted to compare both.

Thank you!

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

nreimerscommented, Mar 20, 2020

Hi, I just committed an option to return the token embeddings:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentences = ['This framework generates embeddings for each input sentence',
             'Sentences are passed as a list of string.',
             'The quick brown fox jumps over the lazy dog.']

sentence_embeddings = model.encode(sentences, output_value='token_embeddings')
print(sentence_embeddings[1]) #Token embeddings for the second sentence. It includes values for the padding tokens, which are all zero

I hope this helps.

You must use the version from this repository, as the pip package is not yet updated. It will be part of version 0.2.6.

0reactions

nreimerscommented, Mar 29, 2020

Hi @mariana-yn Yes, padding is applied on a mini-batch scale. The system checks what is the longest sentence in a mini-batch and then pads all sentences in that mini-batch to the same lenght.

There is also a cut-off at 128 word pieces, sentences longer than that will be truncated. BERT has a quadratic memory and runtime dependence on the sentence length. If sentences get too long, it will consume a lot of memory and will be quite slow.

The max sentence length can be configured in the model, for example, in the BERT model.

Out of the box no code is included to map a text to word-pieces. But it would be quite easy to get this information. You can map the IDs back to their respective entries in the vocab of the tokenizer.

Or you use this code:

tokenizer = BertTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
tokenizer.tokenize(text)

Best Nils Reimers

Top Results From Across the Web

Preprocess Definition & Meaning - Merriam-Webster

The meaning of PREPROCESS is to do preliminary processing of (something, such as data).

Preprocess definition and meaning | Collins English Dictionary

Preprocess definition: to undertake preliminary processing of (data) | Meaning, pronunciation, translations and examples.

preprocess - Wiktionary

preprocess (third-person singular simple present preprocesses, present participle preprocessing, simple past and past participle preprocessed).

Data pre-processing - Wikipedia

Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is...

preProcess: Pre-Processing of Predictors - RDocumentation

Pre-processing transformation (centering, scaling etc.) can be estimated from the training data and applied to any data set with the same variables.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Preprocess

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Longest sequence and truncation of sentence

Not clear which layers are being trained