Possible Bug: Small training/dataset file creates gigantic output
See original GitHub issueHey guys,
I was trying to create a new bert model from scratch via huggingface transformers + tokenizers + dataets (actually using this example script by your team: https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py). It was supposed to be a first test with a small 5 GB raw text file but I can’t even end the preprocessing handled by datasets because this tiny 5 GB text file becomes more than 1 TB when processing. My system was running out of space and crashed prematurely.
I’ve done training from scratch via Google’s bert repo in the past and I can remember that the resulting pretraining data can become quite big. But 5 GB becoming 1 TB was never the case. Is this considered normal or is it a bug?
I’ve used the following CMD:
python xla_spawn.py --num_cores=8 run_mlm.py --model_type bert --config_name config.json --tokenizer_name tokenizer.json --train_file dataset_full.txt --do_train --output_dir out --max_steps 500000 --save_steps 2500 --save_total_limit 2 --prediction_loss_only --line_by_line --max_seq_length 128 --pad_to_max_length --preprocessing_num_workers 16 --per_device_train_batch_size 128 --overwrite_output_dir --debug
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
Hi @NebelAI, we have optimized Datasets’ disk usage in the latest release v1.5.
Feel free to update your Datasets version
and see if it better suits your needs.
It’s actually because of the parameter ‘preprocessing_num_worker’ when using TPU. I am also planning to have my model trained on the google TPU with a 11gb text corpus. With x8 cores enabled, each TPU core has its own dataset. When not using distributed training, the preprocessed file is about 77gb. On the opposite, if enable xla, the file produced will easily consume all my free space(more than 220gb, I think it will be, in the end, around 600gb ). So I think that’s maybe where the problem came from.
Is there any possibility that all of the cores share the same preprocess dataset?
@sgugger @RammMaschine