Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

1.3GB dataset creates over 107GB of cache file!

See original GitHub issue

Environment info

transformers version: 4.4.0 dev0
Platform: Google Colab
Python version: 3.6
PyTorch version (GPU?): 1.7
Tensorflow version (GPU?): None
Using GPU in script?: None. Colab TPU is used
Using distributed or parallel set-up in script?: Using default run_mlm.py script

Who can help

@sgugger

Information

Model I am using (Bert, XLNet …): DistilBert

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

!python /content/transformers/examples/xla_spawn.py --num_cores 8 /content/transformers/examples/language-modeling/run_mlm.py 
--model_type distilbert --config_name /content/TokenizerFiles \
--tokenizer_name /content/drive/TokenizerFiles \
--train_file Corpus.txt \
--mlm_probability 0.15 \ 
--output_dir "/content/TrainingCheckpoints" \
--do_train \
--per_device_train_batch_size 32 \
--save_steps 500 --disable_tqdm False \
--line_by_line True \
--max_seq_length 128 \
--pad_to_max_length True \
--cache_dir /content/cache_dir --save_total_limit 2

The script ends up creating more than 107GB of cache files only with 54% processing done which crashes the Colab environment This means that 200+ GB of space is required to cache and preprocess a mere 1GB file. Am I doing something wrong here? I ran the same script a few days ago and it didn’t give me any such “Out of disk space” error. Because I wanted to use the TPU, I changed pad_to_max_length=True (10192) . That’s all I changed and it does this. Let me know if anyone requires any more data to help me out with this

Expected behavior

The dataset should cache in a minimum amount of disk space. It currently occupies over 150-200x the space of the actual dataset

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:15 (8 by maintainers)

Top GitHub Comments

5reactions

sguggercommented, Feb 17, 2021

Trainer in master completely supports set_transform. If there are some columns removed that should not be, you just have to set the training arguments remove_unused_columns to False for the time being.

1reaction

DarshanDeshpandecommented, Feb 17, 2021

Returning the dataset is more intuitive I feel. Anyway, this is some really good news. I will try to modify the script and make it work. If it does then maybe, if you want, I can clean the code and create a pull request for the same.