Preprocess
See original GitHub issueHello,
First of all, thank you for the great paper and package, both are really great!
Sorry if those are simple questions, but I started using BERT recently.
So, I’m using a custom triplets dataset in portuguese to fine-tune a BERT model (also in portuguese, available with huggingface) and I have a doubt about the preprocessing.
Based on your script training_wikipedia_sections.py
, when I use the DataLoader
function, does it preprocess the texts contained on my datasets or they should already be preprocessed? Or is there another function to do that?
Also, I’m trying many variations during the training and I take almost 1 hour to load the data every time I run the script. Is there a way to save the data and load it later without using the Dataloader
?
Just one more thing, is it possible to fine-tune BERT using the triplets dataset and get word embeddings instead of sentence embeddings? Maybe removing the pooling layer… I wanted to compare both.
Thank you!
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Hi, I just committed an option to return the token embeddings:
I hope this helps.
You must use the version from this repository, as the pip package is not yet updated. It will be part of version 0.2.6.
Hi @mariana-yn Yes, padding is applied on a mini-batch scale. The system checks what is the longest sentence in a mini-batch and then pads all sentences in that mini-batch to the same lenght.
There is also a cut-off at 128 word pieces, sentences longer than that will be truncated. BERT has a quadratic memory and runtime dependence on the sentence length. If sentences get too long, it will consume a lot of memory and will be quite slow.
The max sentence length can be configured in the model, for example, in the BERT model.
Out of the box no code is included to map a text to word-pieces. But it would be quite easy to get this information. You can map the IDs back to their respective entries in the vocab of the tokenizer.
Or you use this code:
Best Nils Reimers