Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to tokenize big dataset

See original GitHub issue

Based on examples, I am trying to train a tokenizer and a model for T5. I use Google Colab pro, when I tried to run the following code:

import datasets

from t5_tokenizer_model import SentencePieceUnigramTokenizer


vocab_size = 32_000
input_sentence_size = None # change to 100_000 works

# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")

tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")

print("len dataset:", len(dataset))

# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
    if input_sentence_size is None:
        input_sentence_size = len(dataset)
    batch_length = 100
    for i in range(0, input_sentence_size, batch_length):
        yield dataset[i: i + batch_length]["text"]


# Train tokenizer
tokenizer.train_from_iterator(
    iterator=batch_iterator(input_sentence_size=input_sentence_size),
    vocab_size=vocab_size,
    show_progress=True,
)

# Save files to disk
tokenizer.save("/content/drive/MyDrive/Pouramini/tokenizer.json")

It get stuck in train_from_iterator because the size of dataset is large (input_sentence_size is around 8M sentences) How can I divide the dataset and run the code on each block and then merge them to a tokenizer output?

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Oct 5, 2021

No the Datasets library never loads the samples unless you request them, using Apache Arrow behind the scenes (you cna read more in the documentation). Using the batch iterator as you did will never load the full dataset in memory.

1reaction

sguggercommented, Oct 5, 2021

This is exactly what is done in the code sample above as well. I’m not too sure I understand what the feature request here is: the training does not get stuck, it just takes along time to finish. There are no progress bars in notebooks, which is a feature you can request on Tokenizers.

At some point, training a tokenizer on such a large dataset in Colab is counter-productive, this environment is not appropriate for CPU intensive work like this. You should spin a CPU instance (those are very cheap) to train your tokenizer then upload the result to the Hub to re-use it once you are ready to train.