Git tracked bare DVC repo (only tracking .dvc file, but don't checkout real file)
See original GitHub issueBackground
We have a lot of daily generated log file, we want to use dvc to tracking our daily log.
Current Method
If we want to use dvc to tracking our daily log, for now, we have to:
- Create a git repo and
dvc init
- Copy the log files into git repo
dvc add
those files anddvc commit
to generate.dvc
file, and thendvc push
to transfer files to remote.git commit
the generated.dvc
file, andgit tag
to add a time stamp(or version)- To save local space, remove all the log files, only leave
.dvc
files
When new daily logs coming, we need to repeat 2-5 step for tracking.
When someone need to analyse log files, they need to:
Clone the git repo, git checkout
a tagged version, and dvc checkout
to download files to the local.
Proposed Method
- Provide a single dvc command (something like
dvc init --bare --remote
or a Python API) to create a “git tracked bare dvc repo” in remote machine - Provide a single dvc command (something like
dvc push --transfer --remote
or a Python API) to directly transfer daily log to the remote, this command has a--tag
option, it will do the above 2-5 step in the remote machine. - When daily logs coming, just do step 2 to transfer files with version tag. (no need to copy into a local git repo)
- When someone need to analyse log files, they can:
Clone the “git tracked bare dvc repo” with only
.dvc
files,git checkout
a tagged version, anddvc checkout
to download files.
Further, because the “git tracked bare dvc repo” should only be modified by the data owner, someone can not push their code to the “git tracked bare dvc repo” remote. Instead, they created a new git repo, and add “git tracked bare dvc repo” as a another git remote. In the git graph, they can see two parallel line: one for our data repo, one for their code repo.
They can cherry-pick a commit from data repo, move .dvc
file into other folder, then do dvc checkout
, the file will pull from our data repo, downloaded into their folder, then they can start writing their code, commit to their git remote.
Sum up
The “git tracked bare dvc repo” we can treat it as a combination of git bare repo
and dvc cache
, it’s a whole structure only for tracking data blob. It can see as a regular git remote, import as a git submodule
, but can only modified by data owner. For the developer, they just include it, pull the data, do their experiments, push to their own repo without touching the data repo.
Also, If you don’t use git, you can still treat it as a regular dvc cache remote. But with git, you have full power of git!
Advance
If you have multiple data source and want to share a single data repo, one can provide --source
option in proposed step 2, then the command will create a git branch with provided source name. This newly created branch is parallel to other source branch (with no common commit). From developer’s view, they can see many parallel branch resides in data repo, and they just need to pick a branch (a data source) to merge into their local working branch.
In case the data owner needs to merge two data source into one, it can be as easy as using git merge
in the data repo, to merge two parallel data source branch into one branch!
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
@allenyllee yep, and what I’m trying to understand what exactly is missing / how is it different from the proposal I have in that thread. It would be really helpful if you could try and if something is missing let us know.
@shcheklein Sorry, I’m not yet tested it. But I saw this question in the dvc fourm: https://discuss.dvc.org/t/large-data-registry-on-nas-with-multiple-dvc-and-non-dvc-users/1294
I think his problem is similar to our’s, and I think what I proposed can solve his problem either.