question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Trainer] Push to hub takes too much space for local `.git` folder

See original GitHub issue

Environment info

  • transformers version: 4.12.0.dev0
  • Platform: Linux-5.3.0-64-generic-x86_64-with-glibc2.10
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.8.1 (True)
  • Tensorflow version (GPU?): 2.6.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
  • Jax version: 0.2.19
  • JaxLib version: 0.1.70
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Information

When using the --push_to_hub functionality each commit writes memory to the .git folder and thus increasing the used space significantly. For a model of the size of bert-base-cased, the .git folder quickly goes up to 10 GB for a short & simple training. This can quickly lead to hard disk errors.

Expected behavior

Everytime push_to_hub(...) is called it should be made sure that the .git folder is “cleaned” so that it doesn’t contain all checkpoints of previous commits.

To reproduce

Do the following:

  1. $ mkdir test
  2. $ ln -s $(path/to/transformers/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) ./
  3. create a run_dummy.sh file with the following code:
CUDA_VISIBLE_DEVICES="0" python run_speech_recognition_ctc.py \
        --dataset_name="timit_asr" \
        --model_name_or_path="patrickvonplaten/wav2vec2-base-repro-960h-libri-85k-steps" \
        --overwrite_output_dir \
        --output_dir="./dummy_run" \
        --train_split_name="train" \
        --num_train_epochs="1" \
        --per_device_train_batch_size="8" \
        --per_device_eval_batch_size="1" \
        --weight_decay="0.005" \
        --learning_rate="1e-4" \
        --text_column_name="text" \
        --save_steps="10" \
        --logging_steps="1" \
        --layerdrop="0.0" \
        --save_total_limit="3" \
        --freeze_feature_extractor \
        --fp16 \
        --push_to_hub \
        --do_train
  1. Run bash run_dummy.sh

Now depending on your upload speed muitiple model weights will be uploaded to the repository on the hub. However, for each commit the model checkpoints are also saved locally in the .git folder. E.g. running the above steps gives me this automatically created repo on the hub: https://huggingface.co/patrickvonplaten/dummy_run/commits/main . As you can see there are 7 commits with 6 commits uploading model checkpoints. Now my local .git folder has a size of 2.1 GB containing exactly 6 .git/lfs/objects each having the size of one model checkpoint (360MB). => So this means that every commit is written to the hard disk.

This can be very problematic for people with limited disk as the .git folder just accumulates saved checkpoints. It essentially makes it impossible to do large model pretraining. I think we should make sure that after each commit the .git folder is somewhat cleaned so that it’s essentially empty.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Nov 5, 2021

Could you give a try to #14294 ?

1reaction
LysandreJikcommented, Oct 27, 2021

We can add it as a method and I’ll see if it makes sense to add it as a keyword argument for certain methods, too, like git_push or the commit context manager. Will work on a PR today or tomorrow.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Model sharing and uploading - Hugging Face
Check the directory before pushing to the model hub.​​ Make sure there are no garbage files in the directory you'll upload. It should...
Read more >
How to shrink the .git folder - Stack Overflow
First off, you need to know what in the .git folder is taking up so much space. One technique is to run the...
Read more >
Introduction to Git and GitHub for Python Developers
The full repository is still stored on all local repos even when you use GitHub. ... If you don't use -m , Git...
Read more >
Git vs GitHub: Difference Between Git and GitHub - Simplilearn
You will next have the option to add code to your repository. However, as you have already set up your Git repository, you...
Read more >
GIT Push and Pull Tutorial - DataCamp
1. Creating a new repository · 9. Push the code in your local repository to GitHub · 10. View your files in your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found