Preserve timestamps during caching
See original GitHub issueBackground
DVC pipelines make decisions about whether to execute stages based on the content (checksum) of the dependencies. This is awesome and it is one of the reasons why we are planning to use DVC for top-level pipeline orchestration.
Unfortunately, DVC pipelines lack features from other worfklow managers, such as parallelization and environment switching. This is both a blessing and a curse – a blessing because it means that DVC pipelines are simple and easy to learn, but a curse because features such as parallelization are central to our existing workflows.
So we are working on use DVC pipelines to coordinate Snakemake workflows. DVC takes care of data integrity, while Snakemake iterates over samples and orchestrates parallel processing, etc.
So far this is going well, at least at the DVC level. But Snakemake makes its decisions about what to execute based on timestamps.
However, when a file is added to a DVC project via dvc add
and dvc repro
both the symlink AND the cached data have a new timestamp corresponding to the time of the DVC operation.
As a result, if we tinker with the content of a stage (a Snakemake workflow) we have to re-run the entire stage (workflow) and not just the new bits, unless we fuss around touch
ing timestamps. This is tedious and error prone, and “rewrites history” by assigning false timestamps.
(Of course, if neither the workflow (stage) or its dependencies have changed, then the entire workflow (stage) is skipped, which is great.)
We prefer the checksum-based execution decisions as in DVC, but we would like to make this compatible with the timestamp-based decisions in Snakemake workflows.
Feature request:
Add an option to dvc add
and dvc repro
to preserve timestamps.
Specifically, when this option is specified for each file or directory added to a DVC project, both the symlink in the workspace and the actual data in the cache should have a timestamp matching that of the original data that was added.
If identical data is added later (identical in content, that is), then the timestamps can be updated to match that of the later file.
In addition, add an option to dvc checkout
so that the timestamps of the symlinks created in the workspace match those of the target data in the cache.
Together, these two changes should allow DVC and Snakemake to play nicely together 😃
Who knows, it might even make sense to make these the default options …
Issue Analytics
- State:
- Created 10 months ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Thanks @efiop! I’m very grateful for the consideration you’re giving this.
Re file system latency, I think that if the timestamp gap is small (in the order of a second or two) then the computational cost of re-running that processing will also be small, so I think we can live with it. Also, to clarify our intended use case, in situations where we use a Snakemake workflow to implement a DVC stage, I expect that the entire workflow will run to completion before the
outs
specified in the DVC stage are cached. That is, DVC caching (and any associated timestamp hacking) will happen after Snakemake has finished running its workflow, and so Snakemake will no longer be relying on these timestamps. As a result, timestamp manipulations during caching are unlikely to affect Snakemake during that particular execution. Hopefully it will be possible to snapshot the timestamp of each file and directory prior (*) to caching, and then apply this timestamp to both the link and the cache after (*) caching.If this were possible, the advantage for us would only be apparent on subsequent reproductions (**). Specifically, if we modify one of the rules in the Snakemake workflow then the worfklow will need to run again, since it is itself one of the
deps
of the parent DVC stage. However, most of the rules within the workflow will probably still be the same, and their intermediate files may not need to be regenerated. If timestamps can be preserved, Snakemake will be able to decide intelligently what needs to be re-run, but currently timestamp re-writing forces the entire workflow to be re-executed, which can sometimes take a couple of days even on high performance hardware with extensive parallelisation.You make a good point about potential conflicts arising from inconsistent timestamps amongst multiple links to the same cached file. This makes me think that perhaps timestamp preservation should be an “all or nothing” option, specified in the config rather than via options to
add
,repro
, etc. At the risk of introducing further complications, timestamp preservation may need to be extended to remotes as well, in order to ensure consistency between instances on different platforms.(*) TIming may be critical here in at least two aspects: 1) handover between DVC stages, 2) initiation of subsequent DVC stages that may themselves also be Snakemake workflows, and which include
deps
generated by an earlier stage. I think everything should be OK so long as the initiation of both 1) and 2) takes place after timestamp restoration, rather than immediately after the earlier stage finishes executing itscmd
.(**) “Subsequent reproductions” includes
repro
ductions in other instances (clones) of the DVC project. A colleague may wish to checkout a project with the express purpose of tweaking one of the workflow-stages, perhaps something as simple as tweaking the formatting of the summary report for that workflow-stage. Ideally they should be able torepro
duce the pipeline – including re-running the tweaked stage but in such a way that only the bare minimum is actually re-executed (regenerating the report in this example).There is vigorous debate within our group as to whether we should use Snakemake to coordinate multiple workflow modules (while asking Snakemake to
dvc add
the results as we go) or whether we should use DVC to coordinate multiple workflows (including, occassionally, Nextflow etc). I am strongly advocating for the latter, because I believe that checksum decisions are superior to timestamp decisions, and becausedvc.lock
ties everything together so beautifully, but timestamp rewriting is proving challenging. I appreciate that DVC has its own ambitions to become a fully mature pipeline manager, but I would like to draw your attention to the fact that most mature workflow managers include “handover” features for integration with other workflow managers. In order to fit into this ecosystem DVC may need to preserve timestamps, or at least offer an option to do so.@johnyaku Saving timestamps as metadata in dvcfiles is indeed reasonable and would be a generally useful thing to have. Due to some other limitations, right now this can only be implemented for standaline files but not for files inside of dvc-tracked directories (the legacy
.dir
object format don’t support that and we have newer mechanisms that are not yet enabled by default).Regarding dvc setting the mtime back, this can be done, but it is more involved and is conflicting with symlinks and hardlinks, since they share the same inode with cache and it can be used in multiple places with different desired timestamps (though this should be doable with copies and maybe reflinks). Also there are limitations like different mtime resolution on different filesystems (e.g. APFS is notorious for having a 1 sec resolution). Overall with many caveats, this can be done (somewhat related to how we handle isexec), but requires working towards a specific scenario (e.g. snakemake, which we are not using). I’m not sure though that all the caveats will make it worthwhile to be accepted in the upstream dvc, especially with us having our own pipeline management.