Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Trainer] Push to hub takes too much space for local `.git` folder

See original GitHub issue

Environment info

transformers version: 4.12.0.dev0
Platform: Linux-5.3.0-64-generic-x86_64-with-glibc2.10
Python version: 3.8.10
PyTorch version (GPU?): 1.8.1 (True)
Tensorflow version (GPU?): 2.6.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
Jax version: 0.2.19
JaxLib version: 0.1.70
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Information

When using the --push_to_hub functionality each commit writes memory to the .git folder and thus increasing the used space significantly. For a model of the size of bert-base-cased, the .git folder quickly goes up to 10 GB for a short & simple training. This can quickly lead to hard disk errors.

Expected behavior

Everytime push_to_hub(...) is called it should be made sure that the .git folder is “cleaned” so that it doesn’t contain all checkpoints of previous commits.

To reproduce

Do the following:

$ mkdir test
$ ln -s $(path/to/transformers/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) ./
create a run_dummy.sh file with the following code:

CUDA_VISIBLE_DEVICES="0" python run_speech_recognition_ctc.py \
        --dataset_name="timit_asr" \
        --model_name_or_path="patrickvonplaten/wav2vec2-base-repro-960h-libri-85k-steps" \
        --overwrite_output_dir \
        --output_dir="./dummy_run" \
        --train_split_name="train" \
        --num_train_epochs="1" \
        --per_device_train_batch_size="8" \
        --per_device_eval_batch_size="1" \
        --weight_decay="0.005" \
        --learning_rate="1e-4" \
        --text_column_name="text" \
        --save_steps="10" \
        --logging_steps="1" \
        --layerdrop="0.0" \
        --save_total_limit="3" \
        --freeze_feature_extractor \
        --fp16 \
        --push_to_hub \
        --do_train

Run bash run_dummy.sh

Now depending on your upload speed muitiple model weights will be uploaded to the repository on the hub. However, for each commit the model checkpoints are also saved locally in the .git folder. E.g. running the above steps gives me this automatically created repo on the hub: https://huggingface.co/patrickvonplaten/dummy_run/commits/main . As you can see there are 7 commits with 6 commits uploading model checkpoints. Now my local .git folder has a size of 2.1 GB containing exactly 6 .git/lfs/objects each having the size of one model checkpoint (360MB). => So this means that every commit is written to the hard disk.

This can be very problematic for people with limited disk as the .git folder just accumulates saved checkpoints. It essentially makes it impossible to do large model pretraining. I think we should make sure that after each commit the .git folder is somewhat cleaned so that it’s essentially empty.

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Nov 5, 2021

Could you give a try to #14294 ?

1reaction

LysandreJikcommented, Oct 27, 2021

We can add it as a method and I’ll see if it makes sense to add it as a keyword argument for certain methods, too, like git_push or the commit context manager. Will work on a PR today or tomorrow.