How to tokenize big dataset
See original GitHub issueBased on examples, I am trying to train a tokenizer and a model for T5. I use Google Colab pro, when I tried to run the following code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None # change to 100_000 works
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
print("len dataset:", len(dataset))
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("/content/drive/MyDrive/Pouramini/tokenizer.json")
It get stuck in train_from_iterator
because the size of dataset is large (input_sentence_size
is around 8M sentences)
How can I divide the dataset and run the code on each block and then merge them to a tokenizer output?
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
How to tokenize large contexts without running out of memory
I am currently trying to tokenize the Squad-dataset for finetuning the Reformer-model. The Problem I have is that the Reformer-Model needs a ...
Read more >Tokenization tutorial | Kaggle
Read text into R; Select only certain lines; Tokenize text using the tidytext package; Calculate token frequency (how often each token shows up...
Read more >Tokenizing a huge quantity of text in python - Stack Overflow
I have a huge list of text files to tokenize. I have the following code which works for a small dataset. I am...
Read more >How to tokenize large NLP dataset - Fast.ai forums
I have a dataset where the inbuilt fastai tokenizer is getting quite slow whereas keras tokenizer does it in a matter of few...
Read more >How-to Build a Transformer Tokenizer - Towards Data Science
Great! We put together a custom tokenizer trained on the Latin subset of the huge OSCAR dataset. That's all for this article! I...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
No the Datasets library never loads the samples unless you request them, using Apache Arrow behind the scenes (you cna read more in the documentation). Using the batch iterator as you did will never load the full dataset in memory.
This is exactly what is done in the code sample above as well. I’m not too sure I understand what the feature request here is: the training does not get stuck, it just takes along time to finish. There are no progress bars in notebooks, which is a feature you can request on Tokenizers.
At some point, training a tokenizer on such a large dataset in Colab is counter-productive, this environment is not appropriate for CPU intensive work like this. You should spin a CPU instance (those are very cheap) to train your tokenizer then upload the result to the Hub to re-use it once you are ready to train.