exp push: fails for >50MB commits
See original GitHub issuedvc exp push can fail since GitHub rejects commits >50MB in size. Perhaps use DVC cache instead for such cases?
Part of https://github.com/iterative/cml/issues/560
/CC @pmrowla
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Unable to commit or push zipped files under 50 MB to GitHub ...
The maximum size is 100 MB , the recommended size is 50 MB . The error message: remote: warning: File RAW/private/test1.csv is 86.33...
Read more >Troubleshooting Git - GitLab Docs
The value is specified in bytes, so in the above case the buffer size has been set to 50MB. The default is 1MB....
Read more >exp push | Data Version Control - DVC
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
Read more >You just committed a large file and can't push to GitHub
But oops, GitHub complains that you are trying to commit files larger than 50 Mb and even grinds to a halt if they...
Read more >How to push large files to GitHub | by Ayuna Vogel | Medium
(I hope my learning experience helps other developers who bump into the same issue). Follow directions in your push commit error and go...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

For reference, the issue here was discussed in yesterdays meeting:
When we generate a checkpoint commit, anything marked as a pipeline dependency or output that is also
cache: falsewill be forcefully tracked in Git. This is needed to so that DVC can properly preserve/restore the state of the workspace when resuming checkpoint runs.The problem w/the large commits is most likely occurring when users have large, intermediate
cache: falseoutputs/deps that they don’t want tracked at all (by DVC or Git). But DVC cannot tell the difference between an output that iscache: false“because the user will track it with Git” orcache: false“because it should not be tracked it all”, and we end up incorrectly tracking these files in the checkpoint Git commits.One thing to note here is that DVC will not do the forced tracking for files which are both
cache: falseand .gitignored. This issue is potentially also happening because users don’t bother gitignoring these intermediate files/dirs (since the user knows not to manuallygit addthem). However this doesn’t work for an automated CI run like in CML.If these intermediate files/dirs are properly gitignored, it would also stop DVC from generating these bloated checkpoint commits. I think we need to clearly document this behavior on both the DVC and CML sides.
@dberenbaum @casperdcl
On a related note we may want to add an option to only keep the last N checkpoints. May save disk space as well as sanity when doing
exp showon 10 billion checkpoints.