Bug with Training T5 Tokenizers on New Data
See original GitHub issueSystem Info
transformers
version: 4.22.2- Platform: Linux-5.10.135-122.509.amzn2.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Huggingface_hub version: 0.10.0
- PyTorch version (GPU?): 1.12.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
# Train T5's tokenizer on some new data.
training_corpus = ["12rdpo2rkfp", "$##@sdfag", "ja23m d@#"]
tokenizer = AutoTokenizer.from_pretrained("t5-large")
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, vocab_size=110)
# Print the vocabulary sequentially.
for i in range(110):
print(new_tokenizer.convert_ids_to_tokens([i])[0])
# You'll see sentinel tokens such as `<extra_id_1>` are NOT at the end of the vocabulary.
Expected behavior
The sentinel tokens in T5 must be at the end of the vocabulary. This constraint is stated in the documentation (e.g., here), and official examples are relying on it. The code below is trying to find sentinel tokens from the back of the vocabulary (len(self.tokenizer) - sentinel_ids
).
https://github.com/huggingface/transformers/blob/5cd16f01db3b5499d4665e8624801ed30ba87bdd/examples/flax/language-modeling/run_t5_mlm_flax.py#L378
However, when I follow Hugging Face Course to train T5’s tokenizer on new data. The new tokenizer does not conform to this constraint.
Issue Analytics
- State:
- Created a year ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
Fine-tuning with custom datasets - Hugging Face
In this example, we'll show how to download, tokenize, and train a model ... This data is organized into pos and neg folders...
Read more >Bug Detection - Transformers | Kaggle
In this notebook we train a bug detection model. ... the codet5-base model and will tokenize both the source code and the error...
Read more >huggingface/transformers: New tokenizer API, TensorFlow ...
This new API let you control truncation and padding deeper ... #5252 (@thomwolf) [tokenizers] Several small improvements and bug fixes #5287 ...
Read more >Train your tokenizer - Colaboratory - Google Colab
The training goes very fast thanks to the Tokenizers library, backed by Rust. You now have a new tokenizer ready to preprocess your...
Read more >Notes on Transformers Book Ch. 9 - Christian Mills
More sophisticated methods for training with unlabeled data ... But for any bugs or problems please open a new Issue and tag me...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
As long as it’s done in examples that are focused on T5-only (e.g. not the generic
run_summarization
), no problem with me!@patrickvonplaten @SaulLu I can pick this issue. What should be the approach we need to take ?