question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hugging Face GPT-2 Tokenizer

See original GitHub issue

Hello,

I know that if I choose to add any new “special token” onto the pre-made GPT-2 tokenizer, and if I want to use the pre-trained GPT-2 model for my analysis, I will need to re-train the pre-trained GPT-2 to make the model learn that new special token.

But what if I just add an extra non-special token? for example, a word “paradox” is not included in the existing GPT-2 tokenizer, so say I add the word “paradox” to the existing set of GPT-2 vocabulary, like below:

# load the pre-trained GPT2-tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# adding a new word (not special token) to the existing vocabulary,
# but I am not making any changes to the pre-assigned special tokens
gpt2_tokenizer.add_tokens("paradox")

# get the pre-trained HuggingFace GPT2DoubleHeadsModel 
model_gpt2DoubleHeadsModel = GPT2DoubleHeadsModel.from_pretrained('gpt2', output_hidden_states = True)

# resize the token embeddings
# (not sure what this function does)
model_gpt2DoubleHeadsModel.resize_token_embeddings(len(gpt2_tokenizer))

Given that I didn’t make any changes to the special tokens in the GPT-2-tokenizer, do I still need to train the already pre-trained GPT2DoubleHeadsModel before I start using it, just because I added a new word to the set of vocabulary?

Thank you,

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Jun 3, 2020

You would need to fine-tune your GPT-2 model on a dataset containing the word, yes. The reason being that your model needs to understand in which context is the word used, what it means, etc.

0reactions
h56chocommented, Jun 4, 2020

Thank you! I was able to confirm that what you mentioned in your previous post also works for my case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenAI GPT2 - Hugging Face
GPT-2 is a large transformer-based language model with 1.5 billion parameters, ...
Read more >
gpt2 - Hugging Face
GPT-2 is a transformers model pretrained on a very large corpus of English data in a ... GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model ...
Read more >
OpenAI GPT2 — transformers 3.0.2 documentation
Write With Transformer is a webapp created and hosted by Hugging Face showcasing the ... GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.
Read more >
OpenAI GPT2 — transformers 3.5.0 documentation
Construct a “fast” GPT-2 tokenizer (backed by HuggingFace's tokenizers library). Based on byte-level Byte-Pair-Encoding. This tokenizer has been trained to ...
Read more >
Pyodide GPT-2 Tokenizer - Hugging Face
Python implementation of GPT-2 Tokenizer running inside your browser. Open your browser console to see Pyodide output. Initialization: .
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found