question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gs: potential cloud status regression between 2.10.2 and 2.28.0

See original GitHub issue

Bug Report

dvc push: pushing all the cache, even without modified files

Description

Context

When using DVC, tracking is powerful if any data have been added, removed, changed. The command dvc diff returns what’s changed and is a useful feature. Unfortunately, it seems that it’s not used to push to the remote bucket : it’s not pushing only the files in the diff, but instead is trying to check/push all the cache at once. This can be a pain when large data are tracked and take several minutes instead of seconds.

It’s problematic if you use hooks between git and dvc because everytime you git push it will dvc push all the files again.

Reproduce

  1. dvc add 2 recently created files
  2. dvc commit
  3. time dvc push

2 files were pushed and it took 0m9.024s and saw Querying cache in '..' | and all my files were being pushed

  1. time dvc push again

Same output, all the cache seems to be pushed again. For some of our repos it could take a very large amount of time.

Expected

dvc push only pushes the returns of dvc diff origin/main or from last commit with dvc diff HEAD~1

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.28.0 (pip)
---------------------------------
Platform: Python 3.10.6 on Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 0.13.0
        dvc_objects = 0.5.0
        dvc_render = 0.0.11
        dvc_task = 0.1.2
        dvclive = 0.11.0
        scmrepo = 0.1.1
Supports:
        gs (gcsfs = 2022.7.1),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: gs
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git

Additional Information (if any):

Pushing with verbosity

dvc push -v
2022-09-30 15:23:10,969 DEBUG: Preparing to transfer data from '/home/lorenzofurlan/repos/swirl-data/.dvc/cache' to 'descartes-swirl-dvc'
2022-09-30 15:23:10,969 DEBUG: Preparing to collect status from 'descartes-swirl-dvc'
2022-09-30 15:23:10,969 DEBUG: Collecting status from 'descartes-swirl-dvc'
2022-09-30 15:23:10,970 DEBUG: Querying 3 oids via object_exists
2022-09-30 15:23:11,175 DEBUG: Querying 0 oids via object_exists
2022-09-30 15:23:11,219 DEBUG: Estimated remote size: 4096 files
2022-09-30 15:23:11,219 DEBUG: Querying '11' oids via traverse
Everything is up to date.
2022-09-30 15:23:19,932 DEBUG: Analytics is enabled.
2022-09-30 15:23:19,957 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpv8_66g8l']'
2022-09-30 15:23:19,958 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpv8_66g8l']'

Profiling file

profiling

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:7
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
dberenbaumcommented, Oct 4, 2022

@pmrowla Can we still investigate the regression or add a benchmark for it?

1reaction
mdeboccommented, Oct 3, 2022

Ok very clear

Thanks @pmrowla! I think tracking directories will solve this problem. Although before the regression in google cloud library, tracking each file was not slow, it is not a good practice.

Therefore you may close the issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

dvc Changelog - pyup.io
Release notes generated using configuration in .github/release.yml at main --> What's Changed Other Changes * repro dry: show information if the stage is ......
Read more >
OpenShift Container Platform 4.10 CI/CD
Creating a secret from a .gitconfig file for secured Git. 2.3.4.2.5. ... 2.10.2. Adding subscription entitlements as a build secret. 2.10.3.
Read more >
Package List — Spack 0.20.0.dev0 documentation
The Albany repository on the GitHub site contains hundreds of regression tests and examples that demonstrate the code's capabilities on a wide variety...
Read more >
Bug listing with status RESOLVED with resolution CANTFIX as ...
Bug listing with status RESOLVED with resolution CANTFIX as at 2022/12/17 16:46:24 · Bug:1184 - "Ideas from Sorcerer Linux" status:RESOLVED resolution:CANTFIX ...
Read more >
https://packages.ubuntu.com/eu/bionic/arm64/allpac...
... Argonaut (service to get status from FAI installations) argonaut-fai-nfsroot ... at-spi2-core (2.28.0-1) Assistive Technology Service Provider Interface ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found