Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Filesystem persistence and using dask as a make/snakemake alternative

See original GitHub issue

I brought this issue/feature request on an issue on at the cachey repo and @mrocklin suggested that I open up a new issue here.

It would be useful to add a persistent file-system caching mechanism to dask, to complement its in-memory caching ability. While the main advantage of the in-memory caching is to speed up computations by saving redundant ouputs, a file-system based caching is more about maintaining persistence across sessions, and potential saving interesting intermediate outputs. This is particularly important when developing computational pipelines, because there are frequently bugs in new code which cause crashes. And it is also useful in maintaining persistence between ipython or jupyter notebook sessions. For example, suppose I am developing a workflow in a typical exploratory fashion where A is some dask object:

B = correct_but_expensive_function(A)
C = sketchy_under_development_function(B)
C.to_hdf5("done.h5")

The correct_but_expensive_function could successfully finish after 5 minutes work, but the sketchy_under_development_function might use up all the memory on my linux desktop and force me to quit the process, if I am lucky, or hard-reset the computer, if I am unlucky.

There are many tools out there which are used for automatic scientific workflows by having the users specify some sort of dependency graph between files. These tools are especially popular in bioinformatics, but are broadly applicable in many disciplines. Some popular tools for this include make, luigi, Snakemake, nextflow, and many others. I like these tools a lot, but it can be tedious breaking up a lengthy python script into pieces and manually specifing the dependency graph by hand. This effort seems especially redudant because dask already builds a computational graph behind the scenes.

It would be nice to manually specify nodes of a given dask graph that should be cached and be automatically reloaded from disk or recomputed depending on certain conditions. In theory this could provide a very convenient Make-like replacement which does not require manually specifying file names and allows one to use python data structures. Moreover, some simple wrappers around dask.delayed could be written that allow users to run external command line utilities on the intermediate outputs.

Possible syntax

Some possible syntax for this could be something like this:

cache = FileSystemCache(...)
c = cache.saved(a + b)
d = c**10

The conditions for reloading or rerunning the a+b computation should depend on the type of cache object. Some nice caching reloading/ recomputing conditions could be things like

file modication time (like make)
argument memoization

Another possible syntax, which is slightly more verbose, could be a dict-like cache object:

cache['c'] = a + b
d = cache['c']**10

Automatic memoization of functions using some heuristic like Cachey does could be useful, but I am personally okay with manually indicating which steps should be saved to disk.

Issue Analytics

State:
Created 6 years ago
Reactions:11
Comments:43 (38 by maintainers)

Top GitHub Comments

2reactions

lsorbercommented, May 4, 2019

Graphchain might be of interest to this thread. It is a caching optimiser for dask graphs. Some of its features:

Caches dask computations to any PyFilesystem FS URL, including for example the OS filesystem, to Memory, and S3.
Cache keys are based on a chain of hashes (hence the name graphchain) so that we can identify a cached result almost immediately. This is different from a joblib.Memory-style approach of hashing a computation’s inputs (which can be very expensive when these inputs are large numpy arrays or pandas DataFrames).
A result is only cached if it is expected that it will save time compared to just computing the computation (which depends on the latency and bandwidth characteristics of the cache’s PyFilesystem).

It can be used as a dask graph optimiser (e.g., with dask.config.set(delayed_optimize=graphchain.optimize):), or with the built-in get convenience function (i.e., graphchain.get(dsk, keys, location='s3://mybucket/__graphchain_cache__')).

There’s two years worth of comments here, so I’m not sure if it fits all the use cases described, but if you have any feature requests we’d be happy to take a look!

2reactions

jakirkhamcommented, Jan 23, 2018

Sorry for the slow reply. Have had a lot of meetings of late.

Do you think a custom callback would be better than an array plugin here?

The main advantage of a custom callback would be to enable substitution of values in the Dask graph with ones from the cache without engagement from the developer/user.

Also, I am concerned about the robustness of inspecting the dask graph based on names.

What concerns come to mind?

I have lately been thinking about making dask play well with luigi target objects.

Can certainly see the value in that. Luigi does have a nice API. At least for me, I’d like to tackle this without adding Luigi as a dependency and think it should be possible.

Then, the callback would search the graph for load_if_exists_or_call and replace it with {‘b’: (reader, tgt)}` if the target exists.

So I think this is really where the plugin shines. Namely it separates the concerns of operating on the graph from the graph’s contents. It also avoids editing the graph per se as it can just replace a node in the graph with its contents loaded from disk.

Top Results From Across the Web

Dask DataFrames Best Practices

It is easy to get started with Dask DataFrame, but using it well does require some experience. This page contains suggestions for Dask...

Storing Dask DataFrames in Memory with persist - Coiled

You can store DataFrames in memory with Dask persist which will make downstream queries that depend on the persisted data faster.

Machine learning on distributed Dask using Amazon ... - AWS

We use a SageMaker notebook with the backend integrated with a scalable distributed Dask cluster running on Amazon ECS on Fargate. The following ......

Using Dask on Ray — Ray 2.2.0

persist () with a Dask-on-Ray scheduler will submit the tasks to the Ray cluster and return Ray futures inlined in the Dask collection....