repro: use build cache for deterministic stages
See original GitHub issueIn my experiment I run a few different preprocessing steps which create a different CSV file, then I am modeling this data and also checking different parameters.
When I want to run the same experiment on 4 different machines (dvc
is connected to the same remote cache).
Running every type of preprocessing will be done on every machine which takes a lot of time and could be omitted by running dvc pull
before dvc repro
and dvc push
after it.
It could work with one command like dvc repro --remote
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:23 (15 by maintainers)
Top Results From Across the Web
repro | Data Version Control - DVC
Description. Provides a way to regenerate data pipeline results, by restoring the dependency graph implicitly defined by the stages listed in dvc.yaml ....
Read more >Two Deterministic Build Bugs | Random ASCII - WordPress.com
Deterministic builds can be quite helpful because they allow caching and sharing of build results and test results, thus reducing test costs ...
Read more >Build Stages: Warm up cache - Travis CI Docs
This example has 2 build stages: One job that installs dependencies and warms up the cache for the given branch. Three jobs that...
Read more >NVIDIA Deep Learning TensorRT Documentation
When importing a network using the ONNX parser, the parser owns the weights, so it must not be destroyed until the build phase...
Read more >Apache Maven Build Cache Extension – Overview
Deterministic build state allows reliably cache outputs even of the build in progress and share them between teams using remote cache.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, I will definitely to give it a try)
Actually, all three scenarios are the same: we run the same experiment several times and do not want to recalculate outputs if they are available in local/remote cache.
Second option is more about checking data in the remote cache. This is going to be helpful during experimentation phase. Third option is related to retraining of models: e.g. I run the same pipeline for different input data each month. I do it sequentially, in Docker container containing dvc pipeline and required source code. Some stages inputs in the middle of pipeline might happen to be the same for different pipeline input data. Local build-cache might be helpful in that situation.
BTW, I think it is not so difficult to implement. We could create a folder build-cache inside .dvc folder and store full .dvc files (or just the part related to outputs) under a hash of inputs. Therefore, it would be possible to merge branches painlessly if outputs are ordered in a deterministic order. Merge conflicts will signal that something is wrong with an experiment setup on one of the machines.