question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hello,

First of all, thank you for the great paper and package, both are really great!

Sorry if those are simple questions, but I started using BERT recently.

So, I’m using a custom triplets dataset in portuguese to fine-tune a BERT model (also in portuguese, available with huggingface) and I have a doubt about the preprocessing. Based on your script training_wikipedia_sections.py , when I use the DataLoader function, does it preprocess the texts contained on my datasets or they should already be preprocessed? Or is there another function to do that?

Also, I’m trying many variations during the training and I take almost 1 hour to load the data every time I run the script. Is there a way to save the data and load it later without using the Dataloader?

Just one more thing, is it possible to fine-tune BERT using the triplets dataset and get word embeddings instead of sentence embeddings? Maybe removing the pooling layer… I wanted to compare both.

Thank you!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
nreimerscommented, Mar 20, 2020

Hi, I just committed an option to return the token embeddings:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentences = ['This framework generates embeddings for each input sentence',
             'Sentences are passed as a list of string.',
             'The quick brown fox jumps over the lazy dog.']

sentence_embeddings = model.encode(sentences, output_value='token_embeddings')
print(sentence_embeddings[1]) #Token embeddings for the second sentence. It includes values for the padding tokens, which are all zero

I hope this helps.

You must use the version from this repository, as the pip package is not yet updated. It will be part of version 0.2.6.

0reactions
nreimerscommented, Mar 29, 2020

Hi @mariana-yn Yes, padding is applied on a mini-batch scale. The system checks what is the longest sentence in a mini-batch and then pads all sentences in that mini-batch to the same lenght.

There is also a cut-off at 128 word pieces, sentences longer than that will be truncated. BERT has a quadratic memory and runtime dependence on the sentence length. If sentences get too long, it will consume a lot of memory and will be quite slow.

The max sentence length can be configured in the model, for example, in the BERT model.

Out of the box no code is included to map a text to word-pieces. But it would be quite easy to get this information. You can map the IDs back to their respective entries in the vocab of the tokenizer.

Or you use this code:

tokenizer = BertTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
tokenizer.tokenize(text)

Best Nils Reimers

Read more comments on GitHub >

github_iconTop Results From Across the Web

Preprocess Definition & Meaning - Merriam-Webster
The meaning of PREPROCESS is to do preliminary processing of (something, such as data).
Read more >
Preprocess definition and meaning | Collins English Dictionary
Preprocess definition: to undertake preliminary processing of (data) | Meaning, pronunciation, translations and examples.
Read more >
preprocess - Wiktionary
preprocess (third-person singular simple present preprocesses, present participle preprocessing, simple past and past participle preprocessed).
Read more >
Data pre-processing - Wikipedia
Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is...
Read more >
preProcess: Pre-Processing of Predictors - RDocumentation
Pre-processing transformation (centering, scaling etc.) can be estimated from the training data and applied to any data set with the same variables.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found