DVC integration
See original GitHub issueHi,
Happened to watch your talk at the Databricks Data and AI summit 🎉 . I enjoyed it a lot.
You mentioned considering integrations with version control systems, which made me wonder whether you might consider integrating MLtrace with DVC? DVC is a system for managing versioning of models, datasets and also the pipelines that handle building models from data. What they offer seems to be closely related to what MLtrace does with tracing the lineage of data as it passes through different processing stages.
A couple of thoughts I had:
-
Include the dvc file hash / git tag associated with a file hash for a dataset/intermediate output in the info card of a component. We already have the git hash for the relevant git code commit and including the file hash for DVC might provide data versioning without the user having to manage input output versions through file naming. With the git tag, a user could backtrack to a previous project state by running
dvc checkout {git tag}
, thereby allowing them to inspect what may have changed (model / data / code) between an earlier experiment and the current one. -
Incorporate information from
dvc status
into the staleness feature. The current staleness feature checks whether some time has passed since a component was run, or whether “dependencies have fresher runs that began before the component run started”.dvc status
would also say whether the dataset / intermediate output has been modified between runs, and hence whether or not the pipeline should be re-run.
Design considerations:
- DVC has quite a few features. If we have a few we’d like to integrate into MLtrace, it might make sense to write a companion library similar to what Dagster has done with Great Expectations (dagster-ge) or with Pandas (dagster-pandas).
However, if we want to start with only one or two points of integration (eg. just the file hash), then we might consider adding a method to either the
ComponentRun
orComponent
classes that grabs the necessary information via the repo’s .dvc file or the DVC Python API and displays that info in the UI / CLI
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
I like this solution. Modifying the
IOPointer
is tedious because currently theIOPointer
table’s primary key is the filename.Sounds good! I’m working on having the dvc has as a new attribute in
ComponentRun
. Should have a PR in a day or two 👍