gs: potential cloud status regression between 2.10.2 and 2.28.0
See original GitHub issueBug Report
dvc push: pushing all the cache, even without modified files
Description
Context
When using DVC, tracking is powerful if any data have been added, removed, changed. The command dvc diff returns what’s changed and is a useful feature. Unfortunately, it seems that it’s not used to push to the remote bucket : it’s not pushing only the files in the diff, but instead is trying to check/push all the cache at once. This can be a pain when large data are tracked and take several minutes instead of seconds.
It’s problematic if you use hooks between git and dvc because everytime you git push
it will dvc push
all the files again.
Reproduce
dvc add
2 recently created filesdvc commit
time dvc push
2 files were pushed and it took 0m9.024s and saw Querying cache in '..' |
and all my files were being pushed
time dvc push
again
Same output, all the cache seems to be pushed again. For some of our repos it could take a very large amount of time.
Expected
dvc push
only pushes the returns of dvc diff origin/main
or from last commit with dvc diff HEAD~1
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.28.0 (pip)
---------------------------------
Platform: Python 3.10.6 on Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31
Subprojects:
dvc_data = 0.13.0
dvc_objects = 0.5.0
dvc_render = 0.0.11
dvc_task = 0.1.2
dvclive = 0.11.0
scmrepo = 0.1.1
Supports:
gs (gcsfs = 2022.7.1),
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: gs
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git
Additional Information (if any):
Pushing with verbosity
dvc push -v
2022-09-30 15:23:10,969 DEBUG: Preparing to transfer data from '/home/lorenzofurlan/repos/swirl-data/.dvc/cache' to 'descartes-swirl-dvc'
2022-09-30 15:23:10,969 DEBUG: Preparing to collect status from 'descartes-swirl-dvc'
2022-09-30 15:23:10,969 DEBUG: Collecting status from 'descartes-swirl-dvc'
2022-09-30 15:23:10,970 DEBUG: Querying 3 oids via object_exists
2022-09-30 15:23:11,175 DEBUG: Querying 0 oids via object_exists
2022-09-30 15:23:11,219 DEBUG: Estimated remote size: 4096 files
2022-09-30 15:23:11,219 DEBUG: Querying '11' oids via traverse
Everything is up to date.
2022-09-30 15:23:19,932 DEBUG: Analytics is enabled.
2022-09-30 15:23:19,957 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpv8_66g8l']'
2022-09-30 15:23:19,958 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpv8_66g8l']'
Profiling file
Issue Analytics
- State:
- Created a year ago
- Reactions:7
- Comments:7 (1 by maintainers)
Top GitHub Comments
@pmrowla Can we still investigate the regression or add a benchmark for it?
Ok very clear
Thanks @pmrowla! I think tracking directories will solve this problem. Although before the regression in google cloud library, tracking each file was not slow, it is not a good practice.
Therefore you may close the issue