question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OPT vocab size of model and tokenizer does not match

See original GitHub issue

System Info

  • transformers version: 4.19.2
  • Platform: Linux-5.4.0-72-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.7.0
  • PyTorch version (GPU?): 1.11.0+cu113 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: no

Who can help?

@LysandreJik

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('facebook/opt-350m')
tok = AutoTokenizer.from_pretrained('facebook/opt-350m', use_fast=False)
    
print(model.config.vocab_size)        # 50272
print(tok.vocab_size)                      # 50265

Expected behavior

Hello, I’m not sure whether this is a bug or if I am missing something. In the reproduction script above, the model has a bigger vocabulary than the tokenizer. In my project, the LM produces the token 50272, which the tokenizer doesn’t know and thus the decode() function fails. (I use my own text generation script, so is it by any chance that the model is not supposed to output the last 7 tokens that the tokenizer doesn’t know?)

Best, David

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

0reactions
github-actions[bot]commented, Sep 25, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

OPT - Hugging Face
vocab_size ( int , optional, defaults to 50272) — Vocabulary size of the OPT model. Defines the number of different tokens that can...
Read more >
nlp - what is the difference between len(tokenizer) and ...
So now my question is; what is the difference between tokenizer.vocab_size and len(tokenizer)?. nlp · tokenize · huggingface-transformers ...
Read more >
Text Generation | Kaggle
The versions of TensorFlow you are currently using is 2.3.0 and is not ... /how-to-find-num-words-or-vocabulary-size-of-keras-tokenizer-when-one-is-not-as ...
Read more >
NLP | How to add a domain-specific vocabulary (new tokens ...
Resize the model embeddings matrix so that it matches the tokenizer (new) size (to the token embedding vectors of the existing vocabulary will...
Read more >
How to Build a Bert WordPiece Tokenizer in Python ... - YouTube
Building a transformer model from scratch can often be the only option for many more specific use cases. Although BERT and other transformer ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found