Hugging Face GPT-2 Tokenizer
See original GitHub issueHello,
I know that if I choose to add any new “special token” onto the pre-made GPT-2 tokenizer, and if I want to use the pre-trained GPT-2 model for my analysis, I will need to re-train the pre-trained GPT-2 to make the model learn that new special token.
But what if I just add an extra non-special token? for example, a word “paradox” is not included in the existing GPT-2 tokenizer, so say I add the word “paradox” to the existing set of GPT-2 vocabulary, like below:
# load the pre-trained GPT2-tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# adding a new word (not special token) to the existing vocabulary,
# but I am not making any changes to the pre-assigned special tokens
gpt2_tokenizer.add_tokens("paradox")
# get the pre-trained HuggingFace GPT2DoubleHeadsModel
model_gpt2DoubleHeadsModel = GPT2DoubleHeadsModel.from_pretrained('gpt2', output_hidden_states = True)
# resize the token embeddings
# (not sure what this function does)
model_gpt2DoubleHeadsModel.resize_token_embeddings(len(gpt2_tokenizer))
Given that I didn’t make any changes to the special tokens in the GPT-2-tokenizer, do I still need to train the already pre-trained GPT2DoubleHeadsModel
before I start using it, just because I added a new word to the set of vocabulary?
Thank you,
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
OpenAI GPT2 - Hugging Face
GPT-2 is a large transformer-based language model with 1.5 billion parameters, ...
Read more >gpt2 - Hugging Face
GPT-2 is a transformers model pretrained on a very large corpus of English data in a ... GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model ...
Read more >OpenAI GPT2 — transformers 3.0.2 documentation
Write With Transformer is a webapp created and hosted by Hugging Face showcasing the ... GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.
Read more >OpenAI GPT2 — transformers 3.5.0 documentation
Construct a “fast” GPT-2 tokenizer (backed by HuggingFace's tokenizers library). Based on byte-level Byte-Pair-Encoding. This tokenizer has been trained to ...
Read more >Pyodide GPT-2 Tokenizer - Hugging Face
Python implementation of GPT-2 Tokenizer running inside your browser. Open your browser console to see Pyodide output. Initialization: .
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You would need to fine-tune your GPT-2 model on a dataset containing the word, yes. The reason being that your model needs to understand in which context is the word used, what it means, etc.
Thank you! I was able to confirm that what you mentioned in your previous post also works for my case.