Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pull imports from source

See original GitHub issue

Files that have been added via dvc import-url can be indvididually downloaded from their source location via dvc update <target>, but dvc pull looks for the files on the remote.

This request is for dvc pull to include an option to download these files from their source.

Whether this should be the default behaviour is open to discussion, but it would be helpful to at least have this option, let’s called it --from-source for now.

Use case: In scientific research we often use previously published datasets as references (for comparison against new data, for example). These datasets are hosted on well-funded, stable file servers with stable URLs for each file. It would be helpful to be able to include such data in a DVC repo without having to include it in the DVC remote for that repo.

For less stable URLs (meaning URLs where the target data is subject to change) then I can see the value in including the data in the remote, as this will allow version control and fetching previous versions.

One way to deal with these conflicting use cases is to include an option for marking imported URLs as “fixed” (non-variable). This property could be recorded in the .dvc file so that DVC knows a) not to push the data to the remote and b) to pull the data from source as part of a dvc pull. Vanilla dvc import-url operations continue to function as currently.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:7 (3 by maintainers)

Top GitHub Comments

2reactions

johnyakucommented, Sep 11, 2022

Perhaps the cleanest and simplest way to incorporate this kind of functionality is something like the following:

dvc update --all (download all files that have dvc import-urled from source into the workspace)

Then anyone wanting to replicate a dataset can grab all the necessary data with just

git clone <repo-url>
cd <repo>
dvc pull
dvc update --all

Alternatively, dvc pull could get an additional option as follows:

dvc pull --from-source (fall back to downloading from source iff data not available on remote, or if no remote set)

These alternatives are not exclusive.

But either way, suppose we have project A that uses dvc import-url to import data.csv from https://source.com. Then suppose that project B uses dvc import to import data.csv from project A into project B. Then dvc pull --from-source or dvc update --all (or both) should download data.csv from https://source.com into the workspace for project B.

In #8172 this is described as an edge case. But I suggest that this could be a fairly central use case in scientific publishing were invariant data is lodged in public repositories with stable URLs. Publishing raw data in these public repositories is often a condition of grant funding. So although we are likely to use DVC remotes while we initially compile and analyse our data, we will inevitably upload most if not all of the raw data to these public repositories. But if we continue to organise our datasets with DVC (just with data hosted in public repositories rather than DVC remotes) then future projects can build on already published datasets with a simple dvc import, regardless of where the data is actually hosted.

1reaction

dberenbaumcommented, Sep 12, 2022

Makes sense @johnyaku! For both import and import-url, an option to determine whether to push a copy of the data may be needed, like proposed in https://github.com/iterative/dvc/issues/4527.

Regardless, I see no reason DVC should not try to fallback to the original source if the data is not in the remote.

Top Results From Across the Web

Source imports - GitHub Docs

The Source Import API lets you start an import from a Git, Subversion, Mercurial, or Team Foundation Version Control source repository.

Python import: Advanced Techniques and Tips

The Python import system is as powerful as it is useful. In this in-depth tutorial, ... Modules are loaded into Python by the...

git-fast-import Documentation - Git

fast-import reads a mixed command/data stream from standard input and writes one or more packfiles directly into the current repository. When EOF is...

import - JavaScript - MDN Web Docs - Mozilla

In order to use the import declaration in a source file, the file must be interpreted by the runtime as a module.

Import data from data sources (Power Query) - Microsoft Support

Use Power Query in Excel to import data into Excel from a wide variety of popular data sources, including CSV, XML, JSON, PDF,...