Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration

See original GitHub issue

❓ Questions & Help

Pretrained T5Tokenizer has vocab size of 32100 (32000 tokens plus 100 extra_ids) but the shared embedding layer of T5ForConditionalGeneration has size of (32128, 768). I checked the google-research implementation of T5 and also found that they have vocab size of 32100 also.

Where did the extra 28 embeddings come from and how can we map it to the tokenizer?

To reproduce

from transformers import (
    T5Tokenizer, 
    T5ForConditionalGeneration,
)

tokenizer_pretrained = T5Tokenizer.from_pretrained('t5-base')
model_pretrained = T5ForConditionalGeneration.from_pretrained('t5-base')
len(tokenizer_pretrained.get_vocab()), model_pretrained.state_dict()['shared.weight'].shape

Output:

(32100, torch.Size([32128, 768]))

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:9 (3 by maintainers)

Top GitHub Comments

12reactions

patrickvonplatencommented, Jun 22, 2020

Hey @cstorm125,

I think, those 28 leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number 32128 = 128 * 251 than 32100 = 4 * 8025. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.

Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes

3reactions

Darshan2104commented, Jan 31, 2022

Temporary Solution : model.resize_token_embeddings(len(tokenizer))

Top Results From Across the Web

T5 - Hugging Face

T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and ... from transformers import T5Tokenizer, T5ForConditionalGeneration ...

Which Transformer architecture fits my data? A vocabulary ...

While the architecture's operation is mostly unchanged, the chosen ratio between the number of self- attention layers (depth) and the dimension ...

Fine Tuning T5 Transformer Model with PyTorch

In this article, we will take a pretrained T5-base model and fine tune it to generate a one line summary of news articles...

Summarization: T5 - seekinginference

In this guide we use T5, a pre-trained and very large (e.g., ... T5's tokenizer in the transformers library will handle the details...

How Good is Your Tokenizer? On the Monolingual ...

monolingual with multilingual pretrained language models for 9 typologically diverse ... we estimate the latter by the total number of words.