License confirmation for `dvc get` and `dvc import`
See original GitHub issueThere are various licenses for downloadable datasets. dvc get
and dvc import
can check if there is a LICENSE
file within a tracked directory and print this license, and ask the user for confirmation before download. This allows us to conform with attribution and copyright requirements in licenses like MIT or Apache.
For a Git repository directory in the form
.
├── README.md
├── fashion-mnist
│ ├── LICENSE
│ ├── raw
│ │ ├── t10k-images-idx3-ubyte.gz
│ │ ├── t10k-labels-idx1-ubyte.gz
│ │ ├── train-images-idx3-ubyte.gz
│ │ └── train-labels-idx1-ubyte.gz
│ └── raw.dvc
we use dvc get https://github.com/iterative/dataset-registry/fashion-mnist/raw.dvc
to get the dataset.
At this point, instead of directly downloading, DVC can check whether there is a LICENSE
file in the directory fashion-mnist/
and present it to the user for confirmation. The same is applicable to dvc import
.
I think this should be the default behavior and an option like --skip-license-confirmation
is also needed for scripts.
This provides a basis to provide all public datasets with different license restrictions in a single dataset registry.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:8 (2 by maintainers)
@iesahin I think @pmrowla is suggesting that this could be built on top of DVC but be considered a separate product.
It seems like this issue is more of a feature request for a public dataset registry, with license confirmation being one of the requirements of that feature request. Would you agree @iesahin? Am I missing anything?
DVC doesn’t host or distribute anything though, it’s just tooling. I guess the line is blurred a bit when it comes to Studio, but it still seems to me like anything on the licensing/attribution side of things would be a Studio issue, and not a core DVC issue (similar to the difference between github/gitlab and git).