Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding special tokens to the model

See original GitHub issue

Hello, I am trying to use model.tokenizer.add_special_tokens(special_tokens_dict) to add some special tokens to the model. But after doing that i received indexing error (IndexError: index out of range in self ) when i wanted to encode a sentence. I wonder to know how i can learn the vector representations of new tokens? something like model.resize_token_embeddings(len(t))

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

nreimerscommented, Feb 5, 2021

You can use this code:

tokens = ["TOK1", "TOK2"]
word_embedding_model = model._first_module()   #Your models.Transformer object
word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))

1reaction

nreimerscommented, Mar 26, 2021

Yes, it is correct.

Read more comments on GitHub >

Top Results From Across the Web

How to add some new special tokens to a pretrained tokenizer?

Hi guys. I want to add some new special tokens like [XXX] to a pretrained ByteLevelBPETokenizer, but I can't find how to do...

Utilities for Tokenizers - Hugging Face

The model input with special tokens. Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating...

How to add new special token to the tokenizer? - Stack Overflow

I want to build a multi-class classification model for which I have conversational data as input for the BERT model ...

How to add new tokens to huggingface transformers vocabulary

In this short article, you'll learn how to add new tokens to the vocabulary of a huggingface transformer model.

Adding a new token to a transformer model without breaking ...

add_tokens (new_words) model.resize_token_embeddings(len(tokenizer)) tokenizer.tokenize('myword1 myword2') # result: ['myword1', 'myword2 ...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Preprocessing before using `model.encode(sentences)`

HTTPError: 403 Client Error: