Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

repro: use build cache for deterministic stages

See original GitHub issue

In my experiment I run a few different preprocessing steps which create a different CSV file, then I am modeling this data and also checking different parameters. When I want to run the same experiment on 4 different machines (dvc is connected to the same remote cache). Running every type of preprocessing will be done on every machine which takes a lot of time and could be omitted by running dvc pull before dvc repro and dvc push after it. It could work with one command like dvc repro --remote

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:23 (15 by maintainers)

Top GitHub Comments

2reactions

vasinkdcommented, Feb 4, 2020

Yes, I will definitely to give it a try)

2reactions

vasinkdcommented, Feb 3, 2020

Actually, all three scenarios are the same: we run the same experiment several times and do not want to recalculate outputs if they are available in local/remote cache.

Second option is more about checking data in the remote cache. This is going to be helpful during experimentation phase. Third option is related to retraining of models: e.g. I run the same pipeline for different input data each month. I do it sequentially, in Docker container containing dvc pipeline and required source code. Some stages inputs in the middle of pipeline might happen to be the same for different pipeline input data. Local build-cache might be helpful in that situation.

BTW, I think it is not so difficult to implement. We could create a folder build-cache inside .dvc folder and store full .dvc files (or just the part related to outputs) under a hash of inputs. Therefore, it would be possible to merge branches painlessly if outputs are ordered in a deterministic order. Merge conflicts will signal that something is wrong with an experiment setup on one of the machines.

Top Results From Across the Web

repro | Data Version Control - DVC

Description. Provides a way to regenerate data pipeline results, by restoring the dependency graph implicitly defined by the stages listed in dvc.yaml ....

Two Deterministic Build Bugs | Random ASCII - WordPress.com

Deterministic builds can be quite helpful because they allow caching and sharing of build results and test results, thus reducing test costs ...

Build Stages: Warm up cache - Travis CI Docs

This example has 2 build stages: One job that installs dependencies and warms up the cache for the given branch. Three jobs that...

NVIDIA Deep Learning TensorRT Documentation

When importing a network using the ONNX parser, the parser owns the weights, so it must not be destroyed until the build phase...

Apache Maven Build Cache Extension – Overview

Deterministic build state allows reliably cache outputs even of the build in progress and share them between teams using remote cache.