question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hi,

Happened to watch your talk at the Databricks Data and AI summit 🎉 . I enjoyed it a lot.

You mentioned considering integrations with version control systems, which made me wonder whether you might consider integrating MLtrace with DVC? DVC is a system for managing versioning of models, datasets and also the pipelines that handle building models from data. What they offer seems to be closely related to what MLtrace does with tracing the lineage of data as it passes through different processing stages.

A couple of thoughts I had:

  1. Include the dvc file hash / git tag associated with a file hash for a dataset/intermediate output in the info card of a component. We already have the git hash for the relevant git code commit and including the file hash for DVC might provide data versioning without the user having to manage input output versions through file naming. With the git tag, a user could backtrack to a previous project state by running dvc checkout {git tag}, thereby allowing them to inspect what may have changed (model / data / code) between an earlier experiment and the current one.

  2. Incorporate information from dvc status into the staleness feature. The current staleness feature checks whether some time has passed since a component was run, or whether “dependencies have fresher runs that began before the component run started”. dvc status would also say whether the dataset / intermediate output has been modified between runs, and hence whether or not the pipeline should be re-run.

Design considerations:

  1. DVC has quite a few features. If we have a few we’d like to integrate into MLtrace, it might make sense to write a companion library similar to what Dagster has done with Great Expectations (dagster-ge) or with Pandas (dagster-pandas). However, if we want to start with only one or two points of integration (eg. just the file hash), then we might consider adding a method to either the ComponentRun or Component classes that grabs the necessary information via the repo’s .dvc file or the DVC Python API and displays that info in the UI / CLI

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
shreyashankarcommented, Jun 9, 2021

There is a quicker fix where we can include a git tag as a ComponentRun attribute

I like this solution. Modifying the IOPointer is tedious because currently the IOPointer table’s primary key is the filename.

1reaction
jeannefukumarucommented, Jun 8, 2021

Sounds good! I’m working on having the dvc has as a new attribute in ComponentRun. Should have a PR in a day or two 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Version Control · DVC
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
Read more >
iterative/dvc: Data Version Control | Git for Data & Models | ML ...
Data Version Control or DVC is a command line tool and VS Code Extension to help you develop reproducible machine learning projects:.
Read more >
Data Version Control With Python and DVC - Real Python
In this tutorial, you'll learn to use DVC, a powerful tool that solves many problems ... DVC offers the possibility to integrate the...
Read more >
MLflow and DVC for open-source reproducible Machine ...
It is now also integrated with MLflow. This means MLflow users can now manage MLflow experiments in a coherent environment, alongside their code...
Read more >
DVC and Hydra integration - Iterative.ai
Therefore, we decided to tackle this by providing a deeper integration: using Hydra internals inside DVC and allowing users to benefit from the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found