question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Git tracked bare DVC repo (only tracking .dvc file, but don't checkout real file)

See original GitHub issue

Background

We have a lot of daily generated log file, we want to use dvc to tracking our daily log.

Current Method

If we want to use dvc to tracking our daily log, for now, we have to:

  1. Create a git repo and dvc init
  2. Copy the log files into git repo
  3. dvc add those files and dvc commit to generate .dvc file, and then dvc push to transfer files to remote.
  4. git commit the generated .dvc file, and git tag to add a time stamp(or version)
  5. To save local space, remove all the log files, only leave .dvc files

When new daily logs coming, we need to repeat 2-5 step for tracking.

When someone need to analyse log files, they need to: Clone the git repo, git checkout a tagged version, and dvc checkout to download files to the local.

Proposed Method

  1. Provide a single dvc command (something like dvc init --bare --remote or a Python API) to create a “git tracked bare dvc repo” in remote machine
  2. Provide a single dvc command (something like dvc push --transfer --remote or a Python API) to directly transfer daily log to the remote, this command has a --tag option, it will do the above 2-5 step in the remote machine.
  3. When daily logs coming, just do step 2 to transfer files with version tag. (no need to copy into a local git repo)
  4. When someone need to analyse log files, they can: Clone the “git tracked bare dvc repo” with only .dvc files, git checkout a tagged version, and dvc checkout to download files.

Further, because the “git tracked bare dvc repo” should only be modified by the data owner, someone can not push their code to the “git tracked bare dvc repo” remote. Instead, they created a new git repo, and add “git tracked bare dvc repo” as a another git remote. In the git graph, they can see two parallel line: one for our data repo, one for their code repo.

They can cherry-pick a commit from data repo, move .dvc file into other folder, then do dvc checkout, the file will pull from our data repo, downloaded into their folder, then they can start writing their code, commit to their git remote.

Sum up

The “git tracked bare dvc repo” we can treat it as a combination of git bare repo and dvc cache, it’s a whole structure only for tracking data blob. It can see as a regular git remote, import as a git submodule, but can only modified by data owner. For the developer, they just include it, pull the data, do their experiments, push to their own repo without touching the data repo.

Also, If you don’t use git, you can still treat it as a regular dvc cache remote. But with git, you have full power of git!

Advance

If you have multiple data source and want to share a single data repo, one can provide --source option in proposed step 2, then the command will create a git branch with provided source name. This newly created branch is parallel to other source branch (with no common commit). From developer’s view, they can see many parallel branch resides in data repo, and they just need to pick a branch (a data source) to merge into their local working branch.

In case the data owner needs to merge two data source into one, it can be as easy as using git merge in the data repo, to merge two parallel data source branch into one branch!

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
shchekleincommented, Aug 13, 2022

@allenyllee yep, and what I’m trying to understand what exactly is missing / how is it different from the proposal I have in that thread. It would be really helpful if you could try and if something is missing let us know.

0reactions
allenylleecommented, Aug 13, 2022

@shcheklein Sorry, I’m not yet tested it. But I saw this question in the dvc fourm: https://discuss.dvc.org/t/large-data-registry-on-nas-with-multiple-dvc-and-non-dvc-users/1294

I think his problem is similar to our’s, and I think what I proposed can solve his problem either.

Read more comments on GitHub >

github_iconTop Results From Across the Web

checkout | Data Version Control - DVC
It accepts paths to tracked files or directories (including paths inside tracked directories), .dvc files, and stage names (found in dvc.yaml ).
Read more >
Cannot add 'folder-path', because it is overlapping with other ...
I don't have any .dvc files in my project. $ dvc add ./project_model/data/ ERROR: Cannot add '/home/me/PycharmProjects/project ...
Read more >
June '22 Community Gems - Iterative.ai
If you don't have data tracked by DVC, run dvc add <file name or ... and a .dvc file will be created to...
Read more >
MLOps: How DVC smartly manages your data sets for training ...
Setting up a DVC repository and do data versioning is easy. ... It means that *.dvc files have to be tracked by Git,...
Read more >
DVC - Data Versioning - Laziness makes Great Engineer
git clone https://github.com/Lee-W/dvc_example/ --branch v1-base $ cd ... dvc add creates a data.dvc file to track data/ and add it into ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found