[Trainer] Push to hub takes too much space for local `.git` folder
See original GitHub issueEnvironment info
transformers
version: 4.12.0.dev0- Platform: Linux-5.3.0-64-generic-x86_64-with-glibc2.10
- Python version: 3.8.10
- PyTorch version (GPU?): 1.8.1 (True)
- Tensorflow version (GPU?): 2.6.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
- Jax version: 0.2.19
- JaxLib version: 0.1.70
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Information
When using the --push_to_hub
functionality each commit writes memory to the .git
folder and thus increasing the used space significantly. For a model of the size of bert-base-cased
, the .git
folder quickly goes up to 10 GB for a short & simple training. This can quickly lead to hard disk errors.
Expected behavior
Everytime push_to_hub(...)
is called it should be made sure that the .git
folder is “cleaned” so that it doesn’t contain all checkpoints of previous commits.
To reproduce
Do the following:
- $ mkdir test
- $ ln -s $(path/to/transformers/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) ./
- create a
run_dummy.sh
file with the following code:
CUDA_VISIBLE_DEVICES="0" python run_speech_recognition_ctc.py \
--dataset_name="timit_asr" \
--model_name_or_path="patrickvonplaten/wav2vec2-base-repro-960h-libri-85k-steps" \
--overwrite_output_dir \
--output_dir="./dummy_run" \
--train_split_name="train" \
--num_train_epochs="1" \
--per_device_train_batch_size="8" \
--per_device_eval_batch_size="1" \
--weight_decay="0.005" \
--learning_rate="1e-4" \
--text_column_name="text" \
--save_steps="10" \
--logging_steps="1" \
--layerdrop="0.0" \
--save_total_limit="3" \
--freeze_feature_extractor \
--fp16 \
--push_to_hub \
--do_train
- Run
bash run_dummy.sh
Now depending on your upload speed muitiple model weights will be uploaded to the repository on the hub. However, for each commit the model checkpoints are also saved locally in the .git
folder. E.g. running the above steps gives me this automatically created repo on the hub: https://huggingface.co/patrickvonplaten/dummy_run/commits/main . As you can see there are 7 commits with 6 commits uploading model checkpoints. Now my local .git
folder has a size of 2.1 GB containing exactly 6 .git/lfs/objects
each having the size of one model checkpoint (360MB). => So this means that every commit is written to the hard disk.
This can be very problematic for people with limited disk as the .git
folder just accumulates saved checkpoints. It essentially makes it impossible to do large model pretraining. I think we should make sure that after each commit the .git
folder is somewhat cleaned so that it’s essentially empty.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Could you give a try to #14294 ?
We can add it as a method and I’ll see if it makes sense to add it as a keyword argument for certain methods, too, like
git_push
or thecommit
context manager. Will work on a PR today or tomorrow.