UX issue with dvc pull - does not pull entire remote cache
See original GitHub issuePlease provide information about your setup
Mac OS X with:
$ dvc --version
0.35.7
The issue is replicable using the Getting Started workspace.
When set up using these commands:
$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started/
$ pip install -r requirements.txt
$ dvc pull
It appears the cache is incomplete. This is observed by checking out different Git tags and attempting to use dvc checkout
.
$ git tag
0-empty
1-initialize
2-remote
3-add-file
4-sources
5-preparation
6-featurization
7-train
8-evaluation
9-bigrams
baseline-experiment
bigrams-experiment
$ git checkout 7-train
Note: checking out '7-train'.
$ dvc status
featurize.dvc:
changed outs:
not in cache: data/features
train.dvc:
changed deps:
modified: data/features
changed outs:
not in cache: model.pkl
$ dvc checkout
ERROR: Failed to load dir cache '.dvc/cache/33/38d2c21bdb521cda0ba4add89e1cb0.dir' - [Errno 2] No such file or directory: '/Volumes/Extra/dvc/example-get-started/.dvc/cache/33/38d2c21bdb521cda0ba4add89e1cb0.dir'
Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
WARNING: Cache 'a66489653d1b6a8ba989799367b32c43' not found. File '{'scheme': 'local', 'path': '/Volumes/Extra/dvc/example-get-started/model.pkl'}' won't be created.
WARNING: Cache '3338d2c21bdb521cda0ba4add89e1cb0.dir' not found. File '{'scheme': 'local', 'path': '/Volumes/Extra/dvc/example-get-started/data/features'}' won't be created.
[##############################] 100% Checkout finished!
$ ls .dvc/cache
38 42 58 68 9d a3 aa dc
Notice that the directory .dvc/cache/33
is not there, just as the error message says.
If instead we initialize the workspace using dvc fetch -T
or dvc fetch -aT
this command does not fail.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:12 (10 by maintainers)
Top Results From Across the Web
Troubleshooting | Data Version Control - DVC
Users may encounter errors when running dvc pull and dvc fetch , like WARNING: Cache 'xxxx' not found. or ERROR: failed to pull...
Read more >Why Git and Git-LFS is not enough to ... - Towards Data Science
A DVC workspace can push data to, or pull data from, remote storage. The remote storage pool can exist on any of the...
Read more >dvc - Python Package Health Analysis - Snyk
The download numbers shown are the average weekly downloads from the last 6 weeks. Security. No known security issues.
Read more >Launching FDS: Ease Of Use And Automation for Git & DVC
DagsHub is launching FDS, a new Open Source Command Line Tool for Fast Data Science. It provides ease of use by automating common...
Read more >DVC: How to Create a Data Version Control System for MLOps
The main point is that on Github you can't save files larger than 100Mb. This may not be a problem if you develop...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It seems there is confusion about my thought about this. First I filed it because @shcheklein asked me to do so 😉
But to me the issue is not about
dvc checkout
but about the behavior ofdvc pull
with no options.My expectation was that
dvc pull
with no options would pull down all data files (ditto withdvc fetch
).I was honestly surprised to see that it had not. My initial assumption was that the remote cache used by the example repository was somehow incomplete. Then I noticed @jorgeorpinel had noted the exact same issue earlier in the week.
That two of us fell into the same problem to me indicates that the UX is not correct. Back in the 1980’s on Usenet we used the phrase “principle of least surprise” which says that a program should produce the least surprise for the user. It’s not my call whether
dvc pull
with no options needs to change its behavior. I’m just saying that I was surprised by the current behavior.By comparison,
git pull
with no options ensures that all commits are pulled from the remote repository.Maybe some info message that looks something like “Pulling cache for the current workspace. To pull cache for the whole project see
dvc pull -h
.” ? Would it help to make it more clear?