question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug with Training T5 Tokenizers on New Data

See original GitHub issue

System Info

  • transformers version: 4.22.2
  • Platform: Linux-5.10.135-122.509.amzn2.x86_64-x86_64-with-glibc2.17
  • Python version: 3.8.13
  • Huggingface_hub version: 0.10.0
  • PyTorch version (GPU?): 1.12.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@patrickvonplaten @SaulLu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

# Train T5's tokenizer on some new data.
training_corpus = ["12rdpo2rkfp", "$##@sdfag", "ja23m d@#"]
tokenizer = AutoTokenizer.from_pretrained("t5-large")
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, vocab_size=110)

# Print the vocabulary sequentially.
for i in range(110):
    print(new_tokenizer.convert_ids_to_tokens([i])[0])

# You'll see sentinel tokens such as `<extra_id_1>` are NOT at the end of the vocabulary.

Expected behavior

The sentinel tokens in T5 must be at the end of the vocabulary. This constraint is stated in the documentation (e.g., here), and official examples are relying on it. The code below is trying to find sentinel tokens from the back of the vocabulary (len(self.tokenizer) - sentinel_ids). https://github.com/huggingface/transformers/blob/5cd16f01db3b5499d4665e8624801ed30ba87bdd/examples/flax/language-modeling/run_t5_mlm_flax.py#L378

However, when I follow Hugging Face Course to train T5’s tokenizer on new data. The new tokenizer does not conform to this constraint.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

3reactions
sguggercommented, Nov 21, 2022

As long as it’s done in examples that are focused on T5-only (e.g. not the generic run_summarization), no problem with me!

1reaction
raghavanonecommented, Oct 12, 2022

@patrickvonplaten @SaulLu I can pick this issue. What should be the approach we need to take ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine-tuning with custom datasets - Hugging Face
In this example, we'll show how to download, tokenize, and train a model ... This data is organized into pos and neg folders...
Read more >
Bug Detection - Transformers | Kaggle
In this notebook we train a bug detection model. ... the codet5-base model and will tokenize both the source code and the error...
Read more >
huggingface/transformers: New tokenizer API, TensorFlow ...
This new API let you control truncation and padding deeper ... #5252 (@thomwolf) [tokenizers] Several small improvements and bug fixes #5287 ...
Read more >
Train your tokenizer - Colaboratory - Google Colab
The training goes very fast thanks to the Tokenizers library, backed by Rust. You now have a new tokenizer ready to preprocess your...
Read more >
Notes on Transformers Book Ch. 9 - Christian Mills
More sophisticated methods for training with unlabeled data ... But for any bugs or problems please open a new Issue and tag me...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found