question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`dvc import` compatible with GitHub App Token

See original GitHub issue

I haven’t seen any proposal of this kind in the issues and - based on my use case - it could solve a number of problems.

Scenario:

  • you have a Data Registry (as git repo + cloud storage, e.g., AWS S3);
  • you have a Experiment Repository in which you have the code that runs experiments (and experiments use data from Data Registry);
  • you wrap this thing with CML and you use GitHub App with Access Tokens

Problem:

  • suppose you use dvc import to obtain some_data from the Data Registry (call it: github.com/username/DataRegistry)
  • it will be recorded in dvc.lock as
     deps:
       - path: some_data
         repo:
           url: git@github.com:username/DataRegistry.git
           rev_lock: af6a1feb542dc05b4d3e9c80deb50e6596876e5f
    
  • now the problem occurs: CML runs this pipeline on instance and when it tries to get the data from Data Registry remote it fails, as it cannot clone the Data Registry repository (in order to do so, it would need to use generated app token).

Proposition:

  • it would be nice if dvc import (or actually dvc pull ?) checked for DATA_REGISTRY_TOKEN env variable and updated the url “on the fly” when pulling data from the remote.

Disclaimer: I was intending on writing this some months ago, at the time the desired behaviour was not in place. I did a quick look, but did not find any mention of it.

Thanks for your effort and please ask any questions in case you need clarification!

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
dtrifirocommented, Sep 12, 2022

@dberenbaum

If you’re thinking of support for git credential helpers, one way this could work is the following

  1. Setup a credential helper (could even be git credential-cache, if cli git is available
  2. Store the credential in the helper
  3. Actually perform the operation.

For example:

printf "[credential]\n    helper=cache" >> ~/.gitconfig 
printf "url=https://github.com\nusername=username\npassword=password" | git credential-cache store
dvc import https://github.com//[...]

This looks a bit clunky to me, although this would work starting with the next dvc release (see https://github.com/iterative/scmrepo/pull/138).

An alternative would be setting up credentials sections in the dvc config that can be looked up when performing import or import-url, something like:

['credential "https://github.com"']
username = username
password = password

Might be also be worth it to provide facilities to write values to the config, something like

dvc config set credential.https://github.com username username       
dvc config set credential.https://github.com password password       

Cons with this approach:

  • configuring git credentials in the dvc config seems a bit out of place
  • possibly duplicating functionality provided by git (see man gitcredentials)
  • storing passwords in config files (although this could be similar to storing remote credentials in --local config)
1reaction
mikolajpabiszczakcommented, Sep 6, 2022

@casperdcl This one

Or do you mean DVC’s deps.*.repo.url is a private repo that needs a PAT for pull access? In which case I guess DVC could support a REPO_TOKEN env var for authentication the same way CML does. Plus it would need a CLI API for it - presumably dvc import --token=... though not sure where it should store said token. Presumably not in dvc.yaml but in the system config? Would mean treating the repo URL like a data remote URL (i.e. give it a shortname, save creds in user config dirs, etc.)

Although I believe the PAT / App Token should not be stored, since (in case of the App Token) it will be re-generated every time the pipeline is run (e.g., in GitHub action). One idea for a solution could be to have --import-token that would work with other dvc commands (e.g., dvc repro), which - when passed - would make sure that anything that was obtained with dvc import would use the passed token to authenticate when checking out the repo under url key.

Read more comments on GitHub >

github_iconTop Results From Across the Web

import | Data Version Control - DVC
Provides an easy way to reuse files or directories tracked in any DVC repositoryDVC repository (e.g. datasets, intermediate results, ML models) or Git...
Read more >
dvc.scm.CloneError: Failed to clone repo - Stack Overflow
I think the problem here is that DVC doesn't have access to gitlab private repo. When you was doing dvc import what URL...
Read more >
MLOps: How DVC smartly manages your data sets for training ...
Like Git, DVC is configurable (remote storage, scope) and has “add”, “push”, “pull”, “checkout” commands for managing your data files. DVC is compatible...
Read more >
How to Build an ML Platform from Scratch - Tutorial - Aporia
It takes AWS credentials and Pulumi access token from GitHub Secrets. Model Template: DVC Integration. Versioning your data is extremely important – when ......
Read more >
Guide to DVC and DAGsHub for Machine Learning Experiment
DAGsHub is similar to GitHub which assists data scientists and machine learning engineers in sharing the data, models, experiments, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found