Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

❓Adding new tokens to pre-trained tokenizer

See original GitHub issue

Details

Hi, I am working with DistilBERT multilingual model for sequence classification tasks where I need to add some additional languages apart from mentioned here. And for that, I am struggling to find the correct way to update tokenizer. From the documentation, I inferred that first, i have to get all new tokens in a list, call tokenizer.add_tokens() and then again i have to pass those new sentences to tokenizer and get them tokenized. So the real question: is there any method which i use to update tokenizer and tokenize sentence at the same time (when tokenizer sees unknown token it adds the token to the dictionary). Thanks in advance.

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

AdityaSoni19031997commented, Apr 10, 2020

Or you can map all such tokens a group of them maybe to an OOV kinda token as well.

1reaction

LysandreJikcommented, Apr 10, 2020

There is no way to dynamically add unknown tokens to the vocabulary. The simplest way to do it would be to encode the sequence, detect unknowns, and then add these to the vocabulary, which seems to be what you did!

Please be aware that you will have to resize the model’s embedding matrix according to the tokens you’ve added.