OPT vocab size of model and tokenizer does not match
See original GitHub issueSystem Info
transformers
version: 4.19.2- Platform: Linux-5.4.0-72-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.11.0+cu113 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('facebook/opt-350m')
tok = AutoTokenizer.from_pretrained('facebook/opt-350m', use_fast=False)
print(model.config.vocab_size) # 50272
print(tok.vocab_size) # 50265
Expected behavior
Hello,
I’m not sure whether this is a bug or if I am missing something.
In the reproduction script above, the model has a bigger vocabulary than the tokenizer. In my project, the LM produces the token 50272
, which the tokenizer doesn’t know and thus the decode() function fails.
(I use my own text generation script, so is it by any chance that the model is not supposed to output the last 7 tokens that the tokenizer doesn’t know?)
Best, David
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
OPT - Hugging Face
vocab_size ( int , optional, defaults to 50272) — Vocabulary size of the OPT model. Defines the number of different tokens that can...
Read more >nlp - what is the difference between len(tokenizer) and ...
So now my question is; what is the difference between tokenizer.vocab_size and len(tokenizer)?. nlp · tokenize · huggingface-transformers ...
Read more >Text Generation | Kaggle
The versions of TensorFlow you are currently using is 2.3.0 and is not ... /how-to-find-num-words-or-vocabulary-size-of-keras-tokenizer-when-one-is-not-as ...
Read more >NLP | How to add a domain-specific vocabulary (new tokens ...
Resize the model embeddings matrix so that it matches the tokenizer (new) size (to the token embedding vectors of the existing vocabulary will...
Read more >How to Build a Bert WordPiece Tokenizer in Python ... - YouTube
Building a transformer model from scratch can often be the only option for many more specific use cases. Although BERT and other transformer ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Duplicate of https://github.com/huggingface/transformers/issues/17431#issuecomment-1224231170
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.