question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration

See original GitHub issue

❓ Questions & Help

Pretrained T5Tokenizer has vocab size of 32100 (32000 tokens plus 100 extra_ids) but the shared embedding layer of T5ForConditionalGeneration has size of (32128, 768). I checked the google-research implementation of T5 and also found that they have vocab size of 32100 also.

Where did the extra 28 embeddings come from and how can we map it to the tokenizer?

To reproduce

from transformers import (
    T5Tokenizer, 
    T5ForConditionalGeneration,
)

tokenizer_pretrained = T5Tokenizer.from_pretrained('t5-base')
model_pretrained = T5ForConditionalGeneration.from_pretrained('t5-base')
len(tokenizer_pretrained.get_vocab()), model_pretrained.state_dict()['shared.weight'].shape

Output:

(32100, torch.Size([32128, 768]))

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:4
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

12reactions
patrickvonplatencommented, Jun 22, 2020

Hey @cstorm125,

I think, those 28 leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number 32128 = 128 * 251 than 32100 = 4 * 8025. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.

Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes

3reactions
Darshan2104commented, Jan 31, 2022

Temporary Solution : model.resize_token_embeddings(len(tokenizer))

Read more comments on GitHub >

github_iconTop Results From Across the Web

T5 - Hugging Face
T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and ... from transformers import T5Tokenizer, T5ForConditionalGeneration ...
Read more >
Which Transformer architecture fits my data? A vocabulary ...
While the architecture's operation is mostly unchanged, the chosen ratio between the number of self- attention layers (depth) and the dimension ...
Read more >
Fine Tuning T5 Transformer Model with PyTorch
In this article, we will take a pretrained T5-base model and fine tune it to generate a one line summary of news articles...
Read more >
Summarization: T5 - seekinginference
In this guide we use T5, a pre-trained and very large (e.g., ... T5's tokenizer in the transformers library will handle the details...
Read more >
How Good is Your Tokenizer? On the Monolingual ...
monolingual with multilingual pretrained language models for 9 typologically diverse ... we estimate the latter by the total number of words.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found