Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hugging Face GPT-2 Tokenizer

See original GitHub issue

Hello,

I know that if I choose to add any new “special token” onto the pre-made GPT-2 tokenizer, and if I want to use the pre-trained GPT-2 model for my analysis, I will need to re-train the pre-trained GPT-2 to make the model learn that new special token.

But what if I just add an extra non-special token? for example, a word “paradox” is not included in the existing GPT-2 tokenizer, so say I add the word “paradox” to the existing set of GPT-2 vocabulary, like below:

# load the pre-trained GPT2-tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# adding a new word (not special token) to the existing vocabulary,
# but I am not making any changes to the pre-assigned special tokens
gpt2_tokenizer.add_tokens("paradox")

# get the pre-trained HuggingFace GPT2DoubleHeadsModel 
model_gpt2DoubleHeadsModel = GPT2DoubleHeadsModel.from_pretrained('gpt2', output_hidden_states = True)

# resize the token embeddings
# (not sure what this function does)
model_gpt2DoubleHeadsModel.resize_token_embeddings(len(gpt2_tokenizer))

Given that I didn’t make any changes to the special tokens in the GPT-2-tokenizer, do I still need to train the already pre-trained GPT2DoubleHeadsModel before I start using it, just because I added a new word to the set of vocabulary?

Thank you,

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Jun 3, 2020

You would need to fine-tune your GPT-2 model on a dataset containing the word, yes. The reason being that your model needs to understand in which context is the word used, what it means, etc.

0reactions

h56chocommented, Jun 4, 2020

Thank you! I was able to confirm that what you mentioned in your previous post also works for my case.

Top Results From Across the Web

OpenAI GPT2 - Hugging Face

GPT-2 is a large transformer-based language model with 1.5 billion parameters, ...

gpt2 - Hugging Face

GPT-2 is a transformers model pretrained on a very large corpus of English data in a ... GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model ...

OpenAI GPT2 — transformers 3.0.2 documentation

Write With Transformer is a webapp created and hosted by Hugging Face showcasing the ... GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.

OpenAI GPT2 — transformers 3.5.0 documentation

Construct a “fast” GPT-2 tokenizer (backed by HuggingFace's tokenizers library). Based on byte-level Byte-Pair-Encoding. This tokenizer has been trained to ...

Pyodide GPT-2 Tokenizer - Hugging Face

Python implementation of GPT-2 Tokenizer running inside your browser. Open your browser console to see Pyodide output. Initialization: .