Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible Bug: Small training/dataset file creates gigantic output

See original GitHub issue

Hey guys,

I was trying to create a new bert model from scratch via huggingface transformers + tokenizers + dataets (actually using this example script by your team: https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py). It was supposed to be a first test with a small 5 GB raw text file but I can’t even end the preprocessing handled by datasets because this tiny 5 GB text file becomes more than 1 TB when processing. My system was running out of space and crashed prematurely.

I’ve done training from scratch via Google’s bert repo in the past and I can remember that the resulting pretraining data can become quite big. But 5 GB becoming 1 TB was never the case. Is this considered normal or is it a bug?

I’ve used the following CMD: python xla_spawn.py --num_cores=8 run_mlm.py --model_type bert --config_name config.json --tokenizer_name tokenizer.json --train_file dataset_full.txt --do_train --output_dir out --max_steps 500000 --save_steps 2500 --save_total_limit 2 --prediction_loss_only --line_by_line --max_seq_length 128 --pad_to_max_length --preprocessing_num_workers 16 --per_device_train_batch_size 128 --overwrite_output_dir --debug

Issue Analytics

State:
Created 3 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

albertvillanovacommented, Mar 22, 2021

Hi @NebelAI, we have optimized Datasets’ disk usage in the latest release v1.5.

Feel free to update your Datasets version

pip install -U datasets

and see if it better suits your needs.

0reactions

CharleoYcommented, Jan 4, 2021

Hey guys,

I was trying to create a new bert model from scratch via huggingface transformers + tokenizers + dataets (actually using this example script by your team: https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py). It was supposed to be a first test with a small 5 GB raw text file but I can’t even end the preprocessing handled by datasets because this tiny 5 GB text file becomes more than 1 TB when processing. My system was running out of space and crashed prematurely.

I’ve done training from scratch via Google’s bert repo in the past and I can remember that the resulting pretraining data can become quite big. But 5 GB becoming 1 TB was never the case. Is this considered normal or is it a bug?

I’ve used the following CMD: python xla_spawn.py --num_cores=8 run_mlm.py --model_type bert --config_name config.json --tokenizer_name tokenizer.json --train_file dataset_full.txt --do_train --output_dir out --max_steps 500000 --save_steps 2500 --save_total_limit 2 --prediction_loss_only --line_by_line --max_seq_length 128 --pad_to_max_length --preprocessing_num_workers 16 --per_device_train_batch_size 128 --overwrite_output_dir --debug

It’s actually because of the parameter ‘preprocessing_num_worker’ when using TPU. I am also planning to have my model trained on the google TPU with a 11gb text corpus. With x8 cores enabled, each TPU core has its own dataset. When not using distributed training, the preprocessed file is about 77gb. On the opposite, if enable xla, the file produced will easily consume all my free space(more than 220gb, I think it will be, in the end, around 600gb ). So I think that’s maybe where the problem came from.

Is there any possibility that all of the cores share the same preprocess dataset?

@sgugger @RammMaschine

Top Results From Across the Web

Impact of Dataset Size on Deep Learning Model Skill And ...

Generally, it is common knowledge that too little training data results in a poor approximation. An over-constrained model will underfit the ...

Big data? Datasets to the rescue! - Hugging Face Course

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Training on Large Datasets That Don't Fit In Memory in Keras

In this article, we will discuss how to train our deep learning network on a huge dataset that does not fit in memory...

An automatically created novel bug dataset and its validation ...

We present the BugHunter Dataset: a novel kind of automatically constructed and freely available bug dataset containing code elements (files, classes, methods) ...

What to Do When Your Data Is Too Big for Your Memory?

Money-costing solution: One possible solution is to buy a new computer with a more robust CPU and larger RAM that is capable of...