Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OPT vocab size of model and tokenizer does not match

See original GitHub issue

System Info

transformers version: 4.19.2
Platform: Linux-5.4.0-72-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.7.0
PyTorch version (GPU?): 1.11.0+cu113 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: no

Who can help?

@LysandreJik

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('facebook/opt-350m')
tok = AutoTokenizer.from_pretrained('facebook/opt-350m', use_fast=False)
    
print(model.config.vocab_size)        # 50272
print(tok.vocab_size)                      # 50265

Expected behavior

Hello, I’m not sure whether this is a bug or if I am missing something. In the reproduction script above, the model has a bigger vocabulary than the tokenizer. In my project, the LM produces the token 50272, which the tokenizer doesn’t know and thus the decode() function fails. (I use my own text generation script, so is it by any chance that the model is not supposed to output the last 7 tokens that the tokenizer doesn’t know?)

Best, David

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

patrickvonplatencommented, Aug 31, 2022

Duplicate of https://github.com/huggingface/transformers/issues/17431#issuecomment-1224231170

0reactions

github-actions[bot]commented, Sep 25, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

OPT - Hugging Face

vocab_size ( int , optional, defaults to 50272) — Vocabulary size of the OPT model. Defines the number of different tokens that can...

nlp - what is the difference between len(tokenizer) and ...

So now my question is; what is the difference between tokenizer.vocab_size and len(tokenizer)?. nlp · tokenize · huggingface-transformers ...

Text Generation | Kaggle

The versions of TensorFlow you are currently using is 2.3.0 and is not ... /how-to-find-num-words-or-vocabulary-size-of-keras-tokenizer-when-one-is-not-as ...

NLP | How to add a domain-specific vocabulary (new tokens ...

Resize the model embeddings matrix so that it matches the tokenizer (new) size (to the token embedding vectors of the existing vocabulary will...

How to Build a Bert WordPiece Tokenizer in Python ... - YouTube

Building a transformer model from scratch can often be the only option for many more specific use cases. Although BERT and other transformer ......