Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Preserve timestamps during caching

See original GitHub issue

Background

DVC pipelines make decisions about whether to execute stages based on the content (checksum) of the dependencies. This is awesome and it is one of the reasons why we are planning to use DVC for top-level pipeline orchestration.

Unfortunately, DVC pipelines lack features from other worfklow managers, such as parallelization and environment switching. This is both a blessing and a curse – a blessing because it means that DVC pipelines are simple and easy to learn, but a curse because features such as parallelization are central to our existing workflows.

So we are working on use DVC pipelines to coordinate Snakemake workflows. DVC takes care of data integrity, while Snakemake iterates over samples and orchestrates parallel processing, etc.

So far this is going well, at least at the DVC level. But Snakemake makes its decisions about what to execute based on timestamps.

However, when a file is added to a DVC project via dvc add and dvc repro both the symlink AND the cached data have a new timestamp corresponding to the time of the DVC operation.

As a result, if we tinker with the content of a stage (a Snakemake workflow) we have to re-run the entire stage (workflow) and not just the new bits, unless we fuss around touching timestamps. This is tedious and error prone, and “rewrites history” by assigning false timestamps.

(Of course, if neither the workflow (stage) or its dependencies have changed, then the entire workflow (stage) is skipped, which is great.)

We prefer the checksum-based execution decisions as in DVC, but we would like to make this compatible with the timestamp-based decisions in Snakemake workflows.

Feature request:

Add an option to dvc add and dvc repro to preserve timestamps.

Specifically, when this option is specified for each file or directory added to a DVC project, both the symlink in the workspace and the actual data in the cache should have a timestamp matching that of the original data that was added.

If identical data is added later (identical in content, that is), then the timestamps can be updated to match that of the later file.

In addition, add an option to dvc checkout so that the timestamps of the symlinks created in the workspace match those of the target data in the cache.

Together, these two changes should allow DVC and Snakemake to play nicely together 😃

Who knows, it might even make sense to make these the default options …

@dlroden

Issue Analytics

State:
Created 10 months ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

johnyakucommented, Nov 24, 2022

Thanks @efiop! I’m very grateful for the consideration you’re giving this.

Re file system latency, I think that if the timestamp gap is small (in the order of a second or two) then the computational cost of re-running that processing will also be small, so I think we can live with it. Also, to clarify our intended use case, in situations where we use a Snakemake workflow to implement a DVC stage, I expect that the entire workflow will run to completion before the outs specified in the DVC stage are cached. That is, DVC caching (and any associated timestamp hacking) will happen after Snakemake has finished running its workflow, and so Snakemake will no longer be relying on these timestamps. As a result, timestamp manipulations during caching are unlikely to affect Snakemake during that particular execution. Hopefully it will be possible to snapshot the timestamp of each file and directory prior (*) to caching, and then apply this timestamp to both the link and the cache after (*) caching.

If this were possible, the advantage for us would only be apparent on subsequent reproductions (**). Specifically, if we modify one of the rules in the Snakemake workflow then the worfklow will need to run again, since it is itself one of the deps of the parent DVC stage. However, most of the rules within the workflow will probably still be the same, and their intermediate files may not need to be regenerated. If timestamps can be preserved, Snakemake will be able to decide intelligently what needs to be re-run, but currently timestamp re-writing forces the entire workflow to be re-executed, which can sometimes take a couple of days even on high performance hardware with extensive parallelisation.

You make a good point about potential conflicts arising from inconsistent timestamps amongst multiple links to the same cached file. This makes me think that perhaps timestamp preservation should be an “all or nothing” option, specified in the config rather than via options to add, repro, etc. At the risk of introducing further complications, timestamp preservation may need to be extended to remotes as well, in order to ensure consistency between instances on different platforms.

(*) TIming may be critical here in at least two aspects: 1) handover between DVC stages, 2) initiation of subsequent DVC stages that may themselves also be Snakemake workflows, and which include deps generated by an earlier stage. I think everything should be OK so long as the initiation of both 1) and 2) takes place after timestamp restoration, rather than immediately after the earlier stage finishes executing its cmd.

(**) “Subsequent reproductions” includes reproductions in other instances (clones) of the DVC project. A colleague may wish to checkout a project with the express purpose of tweaking one of the workflow-stages, perhaps something as simple as tweaking the formatting of the summary report for that workflow-stage. Ideally they should be able to reproduce the pipeline – including re-running the tweaked stage but in such a way that only the bare minimum is actually re-executed (regenerating the report in this example).

There is vigorous debate within our group as to whether we should use Snakemake to coordinate multiple workflow modules (while asking Snakemake to dvc add the results as we go) or whether we should use DVC to coordinate multiple workflows (including, occassionally, Nextflow etc). I am strongly advocating for the latter, because I believe that checksum decisions are superior to timestamp decisions, and because dvc.lock ties everything together so beautifully, but timestamp rewriting is proving challenging. I appreciate that DVC has its own ambitions to become a fully mature pipeline manager, but I would like to draw your attention to the fact that most mature workflow managers include “handover” features for integration with other workflow managers. In order to fit into this ecosystem DVC may need to preserve timestamps, or at least offer an option to do so.

1reaction

efiopcommented, Nov 24, 2022

@johnyaku Saving timestamps as metadata in dvcfiles is indeed reasonable and would be a generally useful thing to have. Due to some other limitations, right now this can only be implemented for standaline files but not for files inside of dvc-tracked directories (the legacy .dir object format don’t support that and we have newer mechanisms that are not yet enabled by default).

Regarding dvc setting the mtime back, this can be done, but it is more involved and is conflicting with symlinks and hardlinks, since they share the same inode with cache and it can be used in multiple places with different desired timestamps (though this should be doable with copies and maybe reflinks). Also there are limitations like different mtime resolution on different filesystems (e.g. APFS is notorious for having a 1 sec resolution). Overall with many caveats, this can be done (somewhat related to how we handle isexec), but requires working towards a specific scenario (e.g. snakemake, which we are not using). I’m not sure though that all the caveats will make it worthwhile to be accepted in the upstream dvc, especially with us having our own pipeline management.

Top Results From Across the Web

Preserve Timestamp in Pipelines Cache | Bitbucket Cloud - Jira

Bitbucket Pipelines cache changes timestamps on all cached files resulting in SBT invalidating all of them failures for all incremental compilers such as...

Caching The Time Stamp - Stack Overflow

You can certainly cache that information if you want, subject to the normal issues of caching, that is, you run the risk of...

Timestamp of files from Cache and Workspace are at time of ...

Hmm I just SSH'ed to the build machine and saw that the timestamps in the attached workspace look like they were preserved.

Can I use a timestamp parameter for cache invalidation?

Implementing a distributed cache with riak and key expiration solves many of these issues when you have multiple application servers.

Preserve file timestamp when extracting packages

Many parts of the internet infrastructure use timestamps, but when Octopus extract files from NuGet packages the original timestamps are ...