❓Adding new tokens to pre-trained tokenizer
See original GitHub issueDetails
Hi, I am working with DistilBERT multilingual model for sequence classification tasks where I need to add some additional languages apart from mentioned here. And for that, I am struggling to find the correct way to update tokenizer. From the documentation, I inferred that first, i have to get all new tokens in a list, call tokenizer.add_tokens()
and then again i have to pass those new sentences to tokenizer and get them tokenized. So the real question: is there any method which i use to update tokenizer and tokenize sentence at the same time (when tokenizer sees unknown token it adds the token to the dictionary). Thanks in advance.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Adding new tokens while preserving tokenization of adjacent ...
I'm trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word....
Read more >NLP | How to add a domain-specific vocabulary (new tokens ...
NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece (image credit).
Read more >Adding new tokens to BERT/RoBERTa while retaining ...
If you want to add new tokens to fine-tune a Roberta-based model, consider training your tokenizer on your corpus.
Read more >How to add new tokens to huggingface transformers vocabulary
First, we need to define and load the transformer model from huggingface. ... Now we can use the add_tokens method of the tokenizer...
Read more >Adding a new token to a transformer model without breaking ...
from transformers import BertTokenizer, BertForMaskedLM new_words = ['myword1', 'myword2'] model = BertForMaskedLM.from_pretrained('bert-base- ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Or you can map all such tokens a group of them maybe to an OOV kinda token as well.
There is no way to dynamically add unknown tokens to the vocabulary. The simplest way to do it would be to encode the sequence, detect unknowns, and then add these to the vocabulary, which seems to be what you did!
Please be aware that you will have to resize the model’s embedding matrix according to the tokens you’ve added.