question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

import-url: `--no-download` and `dvc pull`

See original GitHub issue

Also, this is semi-related to #8164, but IMO using dvc pull here to download and complete/un-partial the imports gets confusing, because subsequent dvc pulls do not download the import from it’s original source location. So dvc pull only works as a substitute for dvc update-with-etag-verification that one time. And for all subsequent downloads the user has to use dvc update anyways.

DVC will try to pull it from a regular remote (and dvc pull will outright fail if your repo does not have a remote configured).

Basically, given this scenario, I don’t think it’s obvious to the user why the 2nd and 3rd dvc pull’s fail, but the final dvc update succeeds:

$ dvc import-url "https://raw.githubusercontent.com/iterative/dvc/main/README.rst" --no-download
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
...

# pull to complete the import (succeeds, downloads from original https://... source URL)
$ dvc pull
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
1 file fetched

# remove the output and cache
$ rm -rf README.rst .dvc/cache                                                                                                                              ⏎

# pull with no remote configured (fails due to no remote configured)
$ dvc pull
ERROR: failed to pull data from the cloud - config file error: no remote specified. Create a default remote with
    dvc remote add -d <remote name> <remote url>

# add a dummy remote
$ dvc remote add -d empty-remote ../empty                                                                                                                   ⏎
Setting 'empty-remote' as a default remote.

# pull with remote configured (fails because it tries to download cached object from dummy remote and not original import URL)
$ dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: README.rst, md5: ff649e3dae3038e81076e6e37dc7f57f
1 file failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/Users/pmrowla/git/scratch/tmp/README.rst
Is your cache up to date?
<https://error.dvc.org/missing-files>

# use update (succeeds, downloads latest version from original https://... source URL)
$ dvc update README.rst.dvc                                                                                                                               ⏎
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'

_Originally posted by @pmrowla in https://github.com/iterative/dvc/issues/8024#issuecomment-1223868011_

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
pmrowlacommented, Aug 26, 2022

🤔 In the scenario you showed above, I wouldn’t expect dvc pull to fail if this is current behavior. Why not always fallback to trying the source location?

This is the current behavior for anything imported with import-url (pull only pulls from DVC remotes), it’s not related to --no-download.

If the issue here is changing all non-repo imports to use the original source as a fallback we can keep this open, but I’m not sure it’s a p1, given that is how import-url has always worked and it has never really been an issue before.

1reaction
pmrowlacommented, Aug 25, 2022

This is not a pre-requisite for #8164 - for imports with cloud versioning pull will always download files directly from the original source location (and not DVC remotes).

I think for this issue the question is whether or not “pull from original source” (with etag verification for regular imports) should be the new default behavior for all imports, and it seems like the consensus was that it should be the default.

But with that change there’s also the follow up question of what to do for non-cloud-versioned-imports when the original source has changed (meaning it can be update’ed), but the user wants to get the old (unchanged) version of the import. Previously this would be done by just dvc pull (assuming the user had pushed it to a DVC remote).

I guess in this case you would have to do something like dvc pull -r <specific remote> to force pull from a DVC remote and not the original source location? (Which again gets confusing when <specific remote> is the configured default remote, since the user never had to use -r before this change)

Read more comments on GitHub >

github_iconTop Results From Across the Web

import-url | Data Version Control - DVC
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
Read more >
import | Data Version Control - DVC
--no-download - create the import .dvc file including the source data information (repository URL and version) but without downloading the associated data.
Read more >
pull | Data Version Control - DVC
The dvc push and dvc pull commands are the means for uploading and downloading data to and from remote storage (S3, SSH, GCS,...
Read more >
update | Data Version Control - DVC
--no-download - Update data checksums in the .dvc file ( md5 , etag , or checksum fields) without actually downloading the latest data....
Read more >
dvc.api.open()
csv in its local cachecache; no download will happen if found. See the Parameters section for more info. Example: Choose a specific remote...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found