`dvc import` compatible with GitHub App Token
See original GitHub issueI haven’t seen any proposal of this kind in the issues and - based on my use case - it could solve a number of problems.
Scenario:
- you have a Data Registry (as git repo + cloud storage, e.g., AWS S3);
- you have a Experiment Repository in which you have the code that runs experiments (and experiments use data from Data Registry);
- you wrap this thing with CML and you use GitHub App with Access Tokens
Problem:
- suppose you use
dvc import
to obtainsome_data
from the Data Registry (call it:github.com/username/DataRegistry
) - it will be recorded in
dvc.lock
asdeps: - path: some_data repo: url: git@github.com:username/DataRegistry.git rev_lock: af6a1feb542dc05b4d3e9c80deb50e6596876e5f
- now the problem occurs: CML runs this pipeline on instance and when it tries to get the data from Data Registry remote it fails, as it cannot clone the Data Registry repository (in order to do so, it would need to use generated app token).
Proposition:
- it would be nice if
dvc import
(or actuallydvc pull
?) checked forDATA_REGISTRY_TOKEN
env variable and updated the url “on the fly” when pulling data from the remote.
Disclaimer: I was intending on writing this some months ago, at the time the desired behaviour was not in place. I did a quick look, but did not find any mention of it.
Thanks for your effort and please ask any questions in case you need clarification!
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:6 (5 by maintainers)
Top Results From Across the Web
import | Data Version Control - DVC
Provides an easy way to reuse files or directories tracked in any DVC repositoryDVC repository (e.g. datasets, intermediate results, ML models) or Git...
Read more >dvc.scm.CloneError: Failed to clone repo - Stack Overflow
I think the problem here is that DVC doesn't have access to gitlab private repo. When you was doing dvc import what URL...
Read more >MLOps: How DVC smartly manages your data sets for training ...
Like Git, DVC is configurable (remote storage, scope) and has “add”, “push”, “pull”, “checkout” commands for managing your data files. DVC is compatible...
Read more >How to Build an ML Platform from Scratch - Tutorial - Aporia
It takes AWS credentials and Pulumi access token from GitHub Secrets. Model Template: DVC Integration. Versioning your data is extremely important – when ......
Read more >Guide to DVC and DAGsHub for Machine Learning Experiment
DAGsHub is similar to GitHub which assists data scientists and machine learning engineers in sharing the data, models, experiments, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@dberenbaum
If you’re thinking of support for git credential helpers, one way this could work is the following
git credential-cache
, if cli git is availableFor example:
This looks a bit clunky to me, although this would work starting with the next dvc release (see https://github.com/iterative/scmrepo/pull/138).
An alternative would be setting up credentials sections in the dvc config that can be looked up when performing
import
orimport-url
, something like:Might be also be worth it to provide facilities to write values to the config, something like
Cons with this approach:
man gitcredentials
)--local
config)@casperdcl This one
Although I believe the PAT / App Token should not be stored, since (in case of the App Token) it will be re-generated every time the pipeline is run (e.g., in GitHub action). One idea for a solution could be to have
--import-token
that would work with other dvc commands (e.g.,dvc repro
), which - when passed - would make sure that anything that was obtained withdvc import
would use the passed token to authenticate when checking out the repo underurl
key.