Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration
See original GitHub issue❓ Questions & Help
Pretrained T5Tokenizer
has vocab size of 32100 (32000 tokens plus 100 extra_ids) but the shared embedding layer of T5ForConditionalGeneration
has size of (32128, 768). I checked the google-research implementation of T5 and also found that they have vocab size of 32100 also.
Where did the extra 28 embeddings come from and how can we map it to the tokenizer?
To reproduce
from transformers import (
T5Tokenizer,
T5ForConditionalGeneration,
)
tokenizer_pretrained = T5Tokenizer.from_pretrained('t5-base')
model_pretrained = T5ForConditionalGeneration.from_pretrained('t5-base')
len(tokenizer_pretrained.get_vocab()), model_pretrained.state_dict()['shared.weight'].shape
Output:
(32100, torch.Size([32128, 768]))
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:9 (3 by maintainers)
Top Results From Across the Web
T5 - Hugging Face
T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and ... from transformers import T5Tokenizer, T5ForConditionalGeneration ...
Read more >Which Transformer architecture fits my data? A vocabulary ...
While the architecture's operation is mostly unchanged, the chosen ratio between the number of self- attention layers (depth) and the dimension ...
Read more >Fine Tuning T5 Transformer Model with PyTorch
In this article, we will take a pretrained T5-base model and fine tune it to generate a one line summary of news articles...
Read more >Summarization: T5 - seekinginference
In this guide we use T5, a pre-trained and very large (e.g., ... T5's tokenizer in the transformers library will handle the details...
Read more >How Good is Your Tokenizer? On the Monolingual ...
monolingual with multilingual pretrained language models for 9 typologically diverse ... we estimate the latter by the total number of words.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @cstorm125,
I think, those
28
leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number32128 = 128 * 251
than32100 = 4 * 8025
. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes
Temporary Solution :
model.resize_token_embeddings(len(tokenizer))