import-url: `--no-download` and `dvc pull`
See original GitHub issueAlso, this is semi-related to #8164, but IMO using dvc pull
here to download and complete/un-partial the imports gets confusing, because subsequent dvc pull
s do not download the import from it’s original source location. So dvc pull
only works as a substitute for dvc update
-with-etag-verification that one time. And for all subsequent downloads the user has to use dvc update
anyways.
DVC will try to pull it from a regular remote (and dvc pull
will outright fail if your repo does not have a remote configured).
Basically, given this scenario, I don’t think it’s obvious to the user why the 2nd and 3rd dvc pull
’s fail, but the final dvc update
succeeds:
$ dvc import-url "https://raw.githubusercontent.com/iterative/dvc/main/README.rst" --no-download
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
...
# pull to complete the import (succeeds, downloads from original https://... source URL)
$ dvc pull
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
1 file fetched
# remove the output and cache
$ rm -rf README.rst .dvc/cache ⏎
# pull with no remote configured (fails due to no remote configured)
$ dvc pull
ERROR: failed to pull data from the cloud - config file error: no remote specified. Create a default remote with
dvc remote add -d <remote name> <remote url>
# add a dummy remote
$ dvc remote add -d empty-remote ../empty ⏎
Setting 'empty-remote' as a default remote.
# pull with remote configured (fails because it tries to download cached object from dummy remote and not original import URL)
$ dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: README.rst, md5: ff649e3dae3038e81076e6e37dc7f57f
1 file failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/Users/pmrowla/git/scratch/tmp/README.rst
Is your cache up to date?
<https://error.dvc.org/missing-files>
# use update (succeeds, downloads latest version from original https://... source URL)
$ dvc update README.rst.dvc ⏎
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
_Originally posted by @pmrowla in https://github.com/iterative/dvc/issues/8024#issuecomment-1223868011_
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
This is the current behavior for anything imported with
import-url
(pull
only pulls from DVC remotes), it’s not related to--no-download
.If the issue here is changing all non-repo imports to use the original source as a fallback we can keep this open, but I’m not sure it’s a p1, given that is how import-url has always worked and it has never really been an issue before.
This is not a pre-requisite for #8164 - for imports with cloud versioning pull will always download files directly from the original source location (and not DVC remotes).
I think for this issue the question is whether or not “pull from original source” (with etag verification for regular imports) should be the new default behavior for all imports, and it seems like the consensus was that it should be the default.
But with that change there’s also the follow up question of what to do for non-cloud-versioned-imports when the original source has changed (meaning it can be
update
’ed), but the user wants to get the old (unchanged) version of the import. Previously this would be done by justdvc pull
(assuming the user had pushed it to a DVC remote).I guess in this case you would have to do something like
dvc pull -r <specific remote>
to force pull from a DVC remote and not the original source location? (Which again gets confusing when<specific remote>
is the configured default remote, since the user never had to use-r
before this change)