Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running DVC in production

See original GitHub issue

I see a lot of value in using DVC during the development phase of a DS project, especially having the ability to reproduce outputs only if dependencies have changed.

One of the problems we are trying to solve is how we can move a data scientists code back and forth from development and production. Ideally we would want their local development experience to be translated easily into production. I’ve created a toy project with DVC to see if we could use it for developing a multi-step pipeline which does data transfers between each step.

However, there is one thing that is unclear when scheduling this same pipeline in Kubeflow/Airflow. Let’s assume that my pipeline is as follows

Get Data
Transform Data
Train Model
Evaluate Model

If I do all of my local development (dvc run, dvc repro) then everything works. But in a production setting I will have unique inputs to my pipeline. For example the datetime stamp or other input variables will change. I can integrate this with DVC by having a file called parameters as a dependency to the Get Data step.

So when I run the pipeline on Airflow on different days, then the dependencies for step 1 will be different, which means it will get recomputed.

The problem that I have is that all of the steps in the graph have their hashes hardcoded based on the local development environment. So even if I rerun this whole pipeline multiple times with the same input parameters, none of the *.dvc files in the pipeline will be updated, meaning everything will rerun from scratch. That’s because they are running in an isolated production environment and not committing code back to the project repo. So dvc looses it’s value when wrapped in a scheduler.

Am I missing something, or is DVC primarily useful in local development only?

Issue Analytics

State:
Created 4 years ago
Reactions:9
Comments:12 (6 by maintainers)

Top GitHub Comments

3reactions

woopcommented, Jul 7, 2019

If step 1 doesn’t commit, then step 2 will have to rerun step 1 when it runs

This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case? In general you can run dvc repro, interrupt it in the middle, and it won’t run completed steps again if run dvc repro again.

To some extent, dvc repro is automatically doing “commits” locally by updating DVC-files as it processes the steps.

Unless I’m missing something 😃

Correct. In my case we are running each step in a pipeline as a separate container (Kubeflow Pipelines, or Airflow with Kubernetes). What this means is that I need to somehow get the DVC files into the next container so that those previous steps don’t rerun. One way is to do git commits, another is having a data management layer that does this between steps.

2reactions

z0ucommented, Mar 31, 2020

I solved this in my project by creating a small script to sync DVC stages between prod and dev. It’s something like this:

sync-dvc data/dev/stage-1.dvc data/prod/stage-1.dvc

It copies the stage config, but keeps the asset hashes unchanged where possible. I agree it would be great if DVC supported this out of the box.

Top Results From Across the Web

Running DVC in production · Discussion #5924 - GitHub

I see a lot of value in using DVC during the development phase of a DS project, especially having the ability to reproduce...

Get Started: Data Pipelines | Data Version Control - DVC

Get started with DVC pipelines. Learn how to capture, organize, version, and reproduce your data science and machine learning workflows.

Using DVC to create an efficient version control system for ...

At first we were looking for a tool to help us dealing with production data files such as trained machine learning algorithms.

Build Production-Ready ML Workflow With DVC and S3

In this article, we will introduce Data Version Control (DVC). This is an open source tool developed by the Iterative.ai team that is...

Preventing Stale Models in Production - Iterative.ai

The reason we're adding DVC and Studio to this project is to automate the way our model evaluation pipeline runs and to version...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Running DVC in production

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

investigate error handlling 'dvc metrics show -a' fetch remote branch

HDFS command used is hardcoded and not compatible with newer versions