discrepancy between the Huggingface T5Tokenizer and the original T5tokenizer
See original GitHub issueEnvironment info
Transformers
version: 4.3.2- Platform: Linux
- Python version: 3.7
- PyTorch version (GPU?): yes
- Tensorflow version (GPU?): -
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
- tokenizers: @n1t0, @LysandreJik
Information
Model I am using T5Tokenizer, and I adapted the code of run_mlm.py [1] to use it with T5 tokenizer, when I run the code I am getting
This tokenizer does not have a mask token which is necessary for masked language modeling. "
ValueError: This tokenizer does not have a mask token which is necessary for masked language modeling. You should pass `mlm=False` to train on causal language modeling instead.
I checked the error and this is because: tokenizer.mask_token is None for T5Tokenizer, checking T5 paper, they use a masked language modeling with their seq2seq objective as the pretraining objective, so they must have trained a masked token as their paper says, could you give me some insight why masked token does not exist in huggingface implementation of T5Tokenizer and how I can correct this to be able to run run_mlm codes ? thank you
[1] https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py
To reproduce
tokenizer = AutoTokenizer.from_pretrained("t5-small")
print(tokenizer.mask_token) => this is None
Expected behavior
The masked token as per T5 paper should exist in T5Tokenizer.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Migrating from previous packages - Hugging Face
The handling of overflowing tokens between the python and rust tokenizers is different. The rust tokenizers do not accept integers in the encoding...
Read more >Normalization and pre-tokenization - Hugging Face Course
Load a tokenizer from the bert-base-cased checkpoint and pass the same example to it. What are the main differences you can see between...
Read more >T5 - Hugging Face
Based on the original T5 model, Google has released some follow-up works: ... As a default, 100 sentinel tokens are available in T5Tokenizer....
Read more >T5: classification using text2text? - Hugging Face Forums
I'm not sure I understand what is the difference between what you are ... import torch tokenizer = T5Tokenizer.from_pretrained("t5-small") ...
Read more >Rostlab/prot_t5_xl_bfd - Hugging Face
One important difference between this T5 model and the original T5 ... re import torch tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi,
T5 is an encoder-decoder Transformer. The
run_mlm.py
script can only be used for encoder-only models, such as BERT, RoBERTa, DeBERTa, etc.Besides this, T5 does not use the regular [MASK] token as BERT. Rather than masked language modeling, T5 is pre-trained on “unsupervised denoising training”. This is explained here.
HI, Did you successfully run the Huggingface T5 pretraining? Can you give me some advice?