Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different vocab_size between model and tokenizer of mT5

See original GitHub issue

Environment info

transformers version: 4.1.1
Platform: ubuntu 18.04
Python version: 3.8.5
PyTorch version (GPU?): 1.7.1

Who can help

@patrickvonplaten

To reproduce

Steps to reproduce the behavior:

from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

mt5s = ['google/mt5-base', 'google/mt5-small', 'google/mt5-large', 'google/mt5-xl', 'google/mt5-xxl']

for mt5 in mt5s:
    model = AutoModelForSeq2SeqLM.from_pretrained(mt5)
    tokenizer = AutoTokenizer.from_pretrained(mt5)

    print()
    print(mt5)
    print(f"tokenizer vocab: {tokenizer.vocab_size}, model vocab: {model.config.vocab_size}")

This is problematic in case when one addes some (special) tokens to tokenizer and resizes the token embedding of the model with model.resize_token_embedding(len(tokenizer))

Expected behavior

vocab_size for model and tokenizer should be the same?

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

LysandreJikcommented, Mar 5, 2021

Hello! This is a duplicate of https://github.com/huggingface/transformers/issues/4875, https://github.com/huggingface/transformers/issues/10144 and https://github.com/huggingface/transformers/issues/9247

@patrickvonplaten, maybe we could do something about this in the docs? In the docs we recommend doing this:

model.resize_token_embedding(len(tokenizer))

but this is unfortunately false for T5!

0reactions

takiholadicommented, Aug 10, 2021

Hello! This is a duplicate of #4875, #10144 and #9247

@patrickvonplaten, maybe we could do something about this in the docs? In the docs we recommend doing this:
model.resize_token_embedding(len(tokenizer))
but this is unfortunately false for T5!

What is the correct way to resize_token_embedding for T5/mT5?

Top Results From Across the Web

Summary of the tokenizers - Hugging Face

More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......

nlp - what is the difference between len(tokenizer) and ...

I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did ...

Which Transformer architecture fits my data? A vocabulary ...

former models is relatively consistent: increase in model ... which includes data from different domains, and report em-.

How Robust is Neural Machine Translation to Language ...

models. Out of different tokenization methods, subword tokenization (Schuster and ... (1) a small vocab size makes the “competition” between languages more.

MosaicML on Twitter: "Do language models really need tokenizers ...

Their Byt5 model (https://t.co/gzEJ9MLG9H) modifies mT5 to take in raw UTF-8 bytes instead of output from a tokenizer. (1/9) https://t.co/fIJMciIEzY" / ...