Opportunistic Caching
See original GitHub issueCurrently we clean up intermediate results quickly if they are not necessary for any further pending computation. This is good because it minimizes the memory footprint on the workers, often allowing us to process larger-than-distributed-memory computations.
However, this can sometimes be inefficient for interactive workloads when users submit related computations one after the other, so that the scheduler has no opportunity to plan ahead, and instead needs to recompute an intermediate result that was previously computed and garbage collected.
We could hold on to some of these results in hopes that the user will request them again. This trades active memory for potential CPU time. Ideally we would hold onto results that:
- Have a small memory footprint
- Take a long time to compute
- Are likely to be requested again (evidenced by recent behavior)
We did this for the single machine scheduler
- http://dask.pydata.org/en/latest/caching.html
- http://matthewrocklin.com/blog/work/2015/08/03/Caching
- https://github.com/dask/cachey
We could do it in the distributed scheduler fairly easily by creating a SchedulerPlugin
that watched all computations, selected computations to keep based on logic similar to what is currently in cachey, and created a fake Client to keep an active reference to those keys in the scheduler.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:3
- Comments:5 (4 by maintainers)
Top GitHub Comments
Scheduler plugins are at https://distributed.dask.org/en/latest/plugins.html and the Scheduler API is at https://distributed.dask.org/en/latest/scheduling-state.html#distributed.scheduler.Scheduler
On Thu, Jun 13, 2019 at 12:11 PM IPetrik notifications@github.com wrote:
To be explicit, the mechanism to keep data on the cluster might look like this: