question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

repro: use build cache for deterministic stages

See original GitHub issue

In my experiment I run a few different preprocessing steps which create a different CSV file, then I am modeling this data and also checking different parameters. When I want to run the same experiment on 4 different machines (dvc is connected to the same remote cache). Running every type of preprocessing will be done on every machine which takes a lot of time and could be omitted by running dvc pull before dvc repro and dvc push after it. It could work with one command like dvc repro --remote

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:3
  • Comments:23 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
vasinkdcommented, Feb 4, 2020

Yes, I will definitely to give it a try)

2reactions
vasinkdcommented, Feb 3, 2020

Actually, all three scenarios are the same: we run the same experiment several times and do not want to recalculate outputs if they are available in local/remote cache.

Second option is more about checking data in the remote cache. This is going to be helpful during experimentation phase. Third option is related to retraining of models: e.g. I run the same pipeline for different input data each month. I do it sequentially, in Docker container containing dvc pipeline and required source code. Some stages inputs in the middle of pipeline might happen to be the same for different pipeline input data. Local build-cache might be helpful in that situation.

BTW, I think it is not so difficult to implement. We could create a folder build-cache inside .dvc folder and store full .dvc files (or just the part related to outputs) under a hash of inputs. Therefore, it would be possible to merge branches painlessly if outputs are ordered in a deterministic order. Merge conflicts will signal that something is wrong with an experiment setup on one of the machines.

Read more comments on GitHub >

github_iconTop Results From Across the Web

repro | Data Version Control - DVC
Description. Provides a way to regenerate data pipeline results, by restoring the dependency graph implicitly defined by the stages listed in dvc.yaml ....
Read more >
Two Deterministic Build Bugs | Random ASCII - WordPress.com
Deterministic builds can be quite helpful because they allow caching and sharing of build results and test results, thus reducing test costs ...
Read more >
Build Stages: Warm up cache - Travis CI Docs
This example has 2 build stages: One job that installs dependencies and warms up the cache for the given branch. Three jobs that...
Read more >
NVIDIA Deep Learning TensorRT Documentation
When importing a network using the ONNX parser, the parser owns the weights, so it must not be destroyed until the build phase...
Read more >
Apache Maven Build Cache Extension – Overview
Deterministic build state allows reliably cache outputs even of the build in progress and share them between teams using remote cache.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found