question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different vocab_size between model and tokenizer of mT5

See original GitHub issue

Environment info

  • transformers version: 4.1.1
  • Platform: ubuntu 18.04
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.7.1

Who can help

@patrickvonplaten

To reproduce

Steps to reproduce the behavior:

from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

mt5s = ['google/mt5-base', 'google/mt5-small', 'google/mt5-large', 'google/mt5-xl', 'google/mt5-xxl']

for mt5 in mt5s:
    model = AutoModelForSeq2SeqLM.from_pretrained(mt5)
    tokenizer = AutoTokenizer.from_pretrained(mt5)

    print()
    print(mt5)
    print(f"tokenizer vocab: {tokenizer.vocab_size}, model vocab: {model.config.vocab_size}")

This is problematic in case when one addes some (special) tokens to tokenizer and resizes the token embedding of the model with model.resize_token_embedding(len(tokenizer))

Expected behavior

vocab_size for model and tokenizer should be the same?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
LysandreJikcommented, Mar 5, 2021

Hello! This is a duplicate of https://github.com/huggingface/transformers/issues/4875, https://github.com/huggingface/transformers/issues/10144 and https://github.com/huggingface/transformers/issues/9247

@patrickvonplaten, maybe we could do something about this in the docs? In the docs we recommend doing this:

model.resize_token_embedding(len(tokenizer))

but this is unfortunately false for T5!

0reactions
takiholadicommented, Aug 10, 2021

Hello! This is a duplicate of #4875, #10144 and #9247

@patrickvonplaten, maybe we could do something about this in the docs? In the docs we recommend doing this:

model.resize_token_embedding(len(tokenizer))

but this is unfortunately false for T5!

What is the correct way to resize_token_embedding for T5/mT5?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Summary of the tokenizers - Hugging Face
More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......
Read more >
nlp - what is the difference between len(tokenizer) and ...
I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did ...
Read more >
Which Transformer architecture fits my data? A vocabulary ...
former models is relatively consistent: increase in model ... which includes data from different domains, and report em-.
Read more >
How Robust is Neural Machine Translation to Language ...
models. Out of different tokenization methods, subword tokenization (Schuster and ... (1) a small vocab size makes the “competition” between languages more.
Read more >
MosaicML on Twitter: "Do language models really need tokenizers ...
Their Byt5 model (https://t.co/gzEJ9MLG9H) modifies mT5 to take in raw UTF-8 bytes instead of output from a tokenizer. (1/9) https://t.co/fIJMciIEzY" / ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found