Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug with Training T5 Tokenizers on New Data

See original GitHub issue

System Info

transformers version: 4.22.2
Platform: Linux-5.10.135-122.509.amzn2.x86_64-x86_64-with-glibc2.17
Python version: 3.8.13
Huggingface_hub version: 0.10.0
PyTorch version (GPU?): 1.12.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@patrickvonplaten @SaulLu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

# Train T5's tokenizer on some new data.
training_corpus = ["12rdpo2rkfp", "$##@sdfag", "ja23m d@#"]
tokenizer = AutoTokenizer.from_pretrained("t5-large")
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, vocab_size=110)

# Print the vocabulary sequentially.
for i in range(110):
    print(new_tokenizer.convert_ids_to_tokens([i])[0])

# You'll see sentinel tokens such as `<extra_id_1>` are NOT at the end of the vocabulary.

Expected behavior

The sentinel tokens in T5 must be at the end of the vocabulary. This constraint is stated in the documentation (e.g., here), and official examples are relying on it. The code below is trying to find sentinel tokens from the back of the vocabulary (len(self.tokenizer) - sentinel_ids). https://github.com/huggingface/transformers/blob/5cd16f01db3b5499d4665e8624801ed30ba87bdd/examples/flax/language-modeling/run_t5_mlm_flax.py#L378

However, when I follow Hugging Face Course to train T5’s tokenizer on new data. The new tokenizer does not conform to this constraint.