Pull imports from source
See original GitHub issueFiles that have been added via dvc import-url
can be indvididually downloaded from their source location via dvc update <target>
, but dvc pull
looks for the files on the remote.
This request is for dvc pull
to include an option to download these files from their source.
Whether this should be the default behaviour is open to discussion, but it would be helpful to at least have this option, let’s called it --from-source
for now.
Use case: In scientific research we often use previously published datasets as references (for comparison against new data, for example). These datasets are hosted on well-funded, stable file servers with stable URLs for each file. It would be helpful to be able to include such data in a DVC repo without having to include it in the DVC remote for that repo.
For less stable URLs (meaning URLs where the target data is subject to change) then I can see the value in including the data in the remote, as this will allow version control and fetching previous versions.
One way to deal with these conflicting use cases is to include an option for marking imported URLs as “fixed” (non-variable). This property could be recorded in the .dvc
file so that DVC knows a) not to push the data to the remote and b) to pull the data from source as part of a dvc pull
. Vanilla dvc import-url
operations continue to function as currently.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:7 (3 by maintainers)
Top GitHub Comments
Perhaps the cleanest and simplest way to incorporate this kind of functionality is something like the following:
dvc update --all
(download all files that havedvc import-url
ed from source into the workspace)Then anyone wanting to replicate a dataset can grab all the necessary data with just
Alternatively,
dvc pull
could get an additional option as follows:dvc pull --from-source
(fall back to downloading from source iff data not available on remote, or if no remote set)These alternatives are not exclusive.
But either way, suppose we have project A that uses
dvc import-url
to importdata.csv
from https://source.com. Then suppose that project B usesdvc import
to importdata.csv
from project A into project B. Thendvc pull --from-source
ordvc update --all
(or both) should downloaddata.csv
from https://source.com into the workspace for project B.In #8172 this is described as an edge case. But I suggest that this could be a fairly central use case in scientific publishing were invariant data is lodged in public repositories with stable URLs. Publishing raw data in these public repositories is often a condition of grant funding. So although we are likely to use DVC remotes while we initially compile and analyse our data, we will inevitably upload most if not all of the raw data to these public repositories. But if we continue to organise our datasets with DVC (just with data hosted in public repositories rather than DVC remotes) then future projects can build on already published datasets with a simple
dvc import
, regardless of where the data is actually hosted.Makes sense @johnyaku! For both
import
andimport-url
, an option to determine whether to push a copy of the data may be needed, like proposed in https://github.com/iterative/dvc/issues/4527.Regardless, I see no reason DVC should not try to fallback to the original source if the data is not in the remote.