question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to tokenize big dataset

See original GitHub issue

Based on examples, I am trying to train a tokenizer and a model for T5. I use Google Colab pro, when I tried to run the following code:

import datasets

from t5_tokenizer_model import SentencePieceUnigramTokenizer


vocab_size = 32_000
input_sentence_size = None # change to 100_000 works

# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")

tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")

print("len dataset:", len(dataset))

# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
    if input_sentence_size is None:
        input_sentence_size = len(dataset)
    batch_length = 100
    for i in range(0, input_sentence_size, batch_length):
        yield dataset[i: i + batch_length]["text"]


# Train tokenizer
tokenizer.train_from_iterator(
    iterator=batch_iterator(input_sentence_size=input_sentence_size),
    vocab_size=vocab_size,
    show_progress=True,
)

# Save files to disk
tokenizer.save("/content/drive/MyDrive/Pouramini/tokenizer.json")

It get stuck in train_from_iterator because the size of dataset is large (input_sentence_size is around 8M sentences) How can I divide the dataset and run the code on each block and then merge them to a tokenizer output?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Oct 5, 2021

No the Datasets library never loads the samples unless you request them, using Apache Arrow behind the scenes (you cna read more in the documentation). Using the batch iterator as you did will never load the full dataset in memory.

1reaction
sguggercommented, Oct 5, 2021

This is exactly what is done in the code sample above as well. I’m not too sure I understand what the feature request here is: the training does not get stuck, it just takes along time to finish. There are no progress bars in notebooks, which is a feature you can request on Tokenizers.

At some point, training a tokenizer on such a large dataset in Colab is counter-productive, this environment is not appropriate for CPU intensive work like this. You should spin a CPU instance (those are very cheap) to train your tokenizer then upload the result to the Hub to re-use it once you are ready to train.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to tokenize large contexts without running out of memory
I am currently trying to tokenize the Squad-dataset for finetuning the Reformer-model. The Problem I have is that the Reformer-Model needs a ...
Read more >
Tokenization tutorial | Kaggle
Read text into R; Select only certain lines; Tokenize text using the tidytext package; Calculate token frequency (how often each token shows up...
Read more >
Tokenizing a huge quantity of text in python - Stack Overflow
I have a huge list of text files to tokenize. I have the following code which works for a small dataset. I am...
Read more >
How to tokenize large NLP dataset - Fast.ai forums
I have a dataset where the inbuilt fastai tokenizer is getting quite slow whereas keras tokenizer does it in a matter of few...
Read more >
How-to Build a Transformer Tokenizer - Towards Data Science
Great! We put together a custom tokenizer trained on the Latin subset of the huge OSCAR dataset. That's all for this article! I...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found