Filesystem persistence and using dask as a make/snakemake alternative
See original GitHub issueI brought this issue/feature request on an issue on at the cachey repo and @mrocklin suggested that I open up a new issue here.
It would be useful to add a persistent file-system caching mechanism to dask
, to complement its in-memory caching ability. While the main advantage of the in-memory caching is to speed up computations by saving redundant ouputs, a file-system based caching is more about maintaining persistence across sessions, and potential saving interesting intermediate outputs. This is particularly important when developing computational pipelines, because there are frequently bugs in new code which cause crashes. And it is also useful in maintaining persistence between ipython or jupyter notebook sessions. For example, suppose I am developing a workflow in a typical exploratory fashion where A
is some dask object:
B = correct_but_expensive_function(A)
C = sketchy_under_development_function(B)
C.to_hdf5("done.h5")
The correct_but_expensive_function
could successfully finish after 5 minutes work, but the sketchy_under_development_function
might use up all the memory on my linux desktop and force me to quit the process, if I am lucky, or hard-reset the computer, if I am unlucky.
There are many tools out there which are used for automatic scientific workflows by having the users specify some sort of dependency graph between files. These tools are especially popular in bioinformatics, but are broadly applicable in many disciplines. Some popular tools for this include make, luigi, Snakemake, nextflow, and many others. I like these tools a lot, but it can be tedious breaking up a lengthy python script into pieces and manually specifing the dependency graph by hand. This effort seems especially redudant because dask
already builds a computational graph behind the scenes.
It would be nice to manually specify nodes of a given dask graph that should be cached and be automatically reloaded from disk or recomputed depending on certain conditions. In theory this could provide a very convenient Make
-like replacement which does not require manually specifying file names and allows one to use python data structures. Moreover, some simple wrappers around dask.delayed
could be written that allow users to run external command line utilities on the intermediate outputs.
Possible syntax
Some possible syntax for this could be something like this:
cache = FileSystemCache(...)
c = cache.saved(a + b)
d = c**10
The conditions for reloading or rerunning the a+b
computation should depend on the type of cache
object. Some nice caching reloading/ recomputing conditions could be things like
- file modication time (like make)
- argument memoization
Another possible syntax, which is slightly more verbose, could be a dict-like cache object:
cache['c'] = a + b
d = cache['c']**10
Automatic memoization of functions using some heuristic like Cachey does could be useful, but I am personally okay with manually indicating which steps should be saved to disk.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:11
- Comments:43 (38 by maintainers)
Top GitHub Comments
Graphchain might be of interest to this thread. It is a caching optimiser for dask graphs. Some of its features:
joblib.Memory
-style approach of hashing a computation’s inputs (which can be very expensive when these inputs are large numpy arrays or pandas DataFrames).It can be used as a dask graph optimiser (e.g.,
with dask.config.set(delayed_optimize=graphchain.optimize):
), or with the built-in get convenience function (i.e.,graphchain.get(dsk, keys, location='s3://mybucket/__graphchain_cache__')
).There’s two years worth of comments here, so I’m not sure if it fits all the use cases described, but if you have any feature requests we’d be happy to take a look!
Sorry for the slow reply. Have had a lot of meetings of late.
The main advantage of a custom callback would be to enable substitution of values in the Dask graph with ones from the cache without engagement from the developer/user.
What concerns come to mind?
Can certainly see the value in that. Luigi does have a nice API. At least for me, I’d like to tackle this without adding Luigi as a dependency and think it should be possible.
So I think this is really where the plugin shines. Namely it separates the concerns of operating on the graph from the graph’s contents. It also avoids editing the graph per se as it can just replace a node in the graph with its contents loaded from disk.