Different vocab_size between model and tokenizer of mT5
See original GitHub issueEnvironment info
transformers
version: 4.1.1- Platform: ubuntu 18.04
- Python version: 3.8.5
- PyTorch version (GPU?): 1.7.1
Who can help
To reproduce
Steps to reproduce the behavior:
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
mt5s = ['google/mt5-base', 'google/mt5-small', 'google/mt5-large', 'google/mt5-xl', 'google/mt5-xxl']
for mt5 in mt5s:
model = AutoModelForSeq2SeqLM.from_pretrained(mt5)
tokenizer = AutoTokenizer.from_pretrained(mt5)
print()
print(mt5)
print(f"tokenizer vocab: {tokenizer.vocab_size}, model vocab: {model.config.vocab_size}")
This is problematic in case when one addes some (special) tokens to tokenizer and resizes the token embedding of the model with model.resize_token_embedding(len(tokenizer))
Expected behavior
vocab_size for model and tokenizer should be the same?
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Summary of the tokenizers - Hugging Face
More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......
Read more >nlp - what is the difference between len(tokenizer) and ...
I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did ...
Read more >Which Transformer architecture fits my data? A vocabulary ...
former models is relatively consistent: increase in model ... which includes data from different domains, and report em-.
Read more >How Robust is Neural Machine Translation to Language ...
models. Out of different tokenization methods, subword tokenization (Schuster and ... (1) a small vocab size makes the “competition” between languages more.
Read more >MosaicML on Twitter: "Do language models really need tokenizers ...
Their Byt5 model (https://t.co/gzEJ9MLG9H) modifies mT5 to take in raw UTF-8 bytes instead of output from a tokenizer. (1/9) https://t.co/fIJMciIEzY" / ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello! This is a duplicate of https://github.com/huggingface/transformers/issues/4875, https://github.com/huggingface/transformers/issues/10144 and https://github.com/huggingface/transformers/issues/9247
@patrickvonplaten, maybe we could do something about this in the docs? In the docs we recommend doing this:
but this is unfortunately false for T5!
What is the correct way to resize_token_embedding for T5/mT5?