1.3GB dataset creates over 107GB of cache file!
See original GitHub issueEnvironment info
transformers
version: 4.4.0 dev0- Platform: Google Colab
- Python version: 3.6
- PyTorch version (GPU?): 1.7
- Tensorflow version (GPU?): None
- Using GPU in script?: None. Colab TPU is used
- Using distributed or parallel set-up in script?: Using default
run_mlm.py
script
Who can help
Information
Model I am using (Bert, XLNet …): DistilBert
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
!python /content/transformers/examples/xla_spawn.py --num_cores 8 /content/transformers/examples/language-modeling/run_mlm.py
--model_type distilbert --config_name /content/TokenizerFiles \
--tokenizer_name /content/drive/TokenizerFiles \
--train_file Corpus.txt \
--mlm_probability 0.15 \
--output_dir "/content/TrainingCheckpoints" \
--do_train \
--per_device_train_batch_size 32 \
--save_steps 500 --disable_tqdm False \
--line_by_line True \
--max_seq_length 128 \
--pad_to_max_length True \
--cache_dir /content/cache_dir --save_total_limit 2
The script ends up creating more than 107GB of cache files only with 54% processing done which crashes the Colab environment This means that 200+ GB of space is required to cache and preprocess a mere 1GB file. Am I doing something wrong here? I ran the same script a few days ago and it didn’t give me any such “Out of disk space” error. Because I wanted to use the TPU, I changed pad_to_max_length=True (10192) . That’s all I changed and it does this. Let me know if anyone requires any more data to help me out with this
Expected behavior
The dataset should cache in a minimum amount of disk space. It currently occupies over 150-200x the space of the actual dataset
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:15 (8 by maintainers)
Top GitHub Comments
Trainer
in master completely supportsset_transform
. If there are some columns removed that should not be, you just have to set the training argumentsremove_unused_columns
toFalse
for the time being.Returning the dataset is more intuitive I feel. Anyway, this is some really good news. I will try to modify the script and make it work. If it does then maybe, if you want, I can clean the code and create a pull request for the same.