question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

exp push: fails for >50MB commits

See original GitHub issue

dvc exp push can fail since GitHub rejects commits >50MB in size. Perhaps use DVC cache instead for such cases?

Part of https://github.com/iterative/cml/issues/560

/CC @pmrowla

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
pmrowlacommented, Jun 17, 2021

For reference, the issue here was discussed in yesterdays meeting:

When we generate a checkpoint commit, anything marked as a pipeline dependency or output that is also cache: false will be forcefully tracked in Git. This is needed to so that DVC can properly preserve/restore the state of the workspace when resuming checkpoint runs.

The problem w/the large commits is most likely occurring when users have large, intermediate cache: false outputs/deps that they don’t want tracked at all (by DVC or Git). But DVC cannot tell the difference between an output that is cache: false “because the user will track it with Git” or cache: false “because it should not be tracked it all”, and we end up incorrectly tracking these files in the checkpoint Git commits.


One thing to note here is that DVC will not do the forced tracking for files which are both cache: false and .gitignored. This issue is potentially also happening because users don’t bother gitignoring these intermediate files/dirs (since the user knows not to manually git add them). However this doesn’t work for an automated CI run like in CML.

If these intermediate files/dirs are properly gitignored, it would also stop DVC from generating these bloated checkpoint commits. I think we need to clearly document this behavior on both the DVC and CML sides.

@dberenbaum @casperdcl

1reaction
casperdclcommented, Jun 17, 2021

we track every iteration

On a related note we may want to add an option to only keep the last N checkpoints. May save disk space as well as sanity when doing exp show on 10 billion checkpoints.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to commit or push zipped files under 50 MB to GitHub ...
The maximum size is 100 MB , the recommended size is 50 MB . The error message: remote: warning: File RAW/private/test1.csv is 86.33...
Read more >
Troubleshooting Git - GitLab Docs
The value is specified in bytes, so in the above case the buffer size has been set to 50MB. The default is 1MB....
Read more >
exp push | Data Version Control - DVC
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
Read more >
You just committed a large file and can't push to GitHub
But oops, GitHub complains that you are trying to commit files larger than 50 Mb and even grinds to a halt if they...
Read more >
How to push large files to GitHub | by Ayuna Vogel | Medium
(I hope my learning experience helps other developers who bump into the same issue). Follow directions in your push commit error and go...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found