Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] A new approach to memory spilling

See original GitHub issue

Question: would the Dask/Distributed community be interested in an improved memory spilling model that fixes the shortcomings of the current one but make use of proxy object wrappers?

In Dask-CUDA we have introduced a new approach to memory spilling that handles object aliasing and JIT memory un-spilling: https://github.com/rapidsai/dask-cuda/pull/451

The result is memory spilling that:

Avoids double counting: https://github.com/dask/distributed/issues/4186
Avoids spilling of the same object multiple times.
Avoids memory spikes because of incorrect memory tally.
Implement just-in-time un-spilling: https://github.com/dask/distributed/pull/3998
Support communication of spilled data so that GPU data don’t have to be un-spilled just to be spilled again as part of communication: https://github.com/rapidsai/dask-cuda/issues/342
First step to enable partial spilled objects such as the spilling of individual columns in a data frame: https://github.com/BlazingDB/blazingsql/issues/1128

The current implement in Dask-CUDA handles CUDA device objects but it is possible to generalize to also handle spilling to disk.

The disadvantage of this approach is the use of proxy objects that get exposed to the users. The inputs to a tasks might be wrapped in a proxy object, which doesn’t mimic the proxied object perfectly. E.g.:

    # Type checking using instance() works as expected but direct type checking doesn't:
    >>> import numpy as np
    >>> from dask_cuda.proxy_object import asproxy
    >>> x = np.arange(3)
    >>> isinstance(asproxy(x), type(x))
    True
    >>>  type(asproxy(x)) is type(x)
    False

Because of this, the approach shouldn’t be enabled by default but do you think that the Dask community would be interested in a generalization of this approach? Or is the proxy object hurdle too much of an issue?

cc. @mrocklin, @jrbourbeau, @quasiben

Issue Analytics

State:
Created 3 years ago
Comments:20 (19 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Mar 23, 2021

Right, breaking the MutableMapping abstraction will make it fairly easy to fix the double counting issue (#4186) and avoid memory spikes because of incorrect memory tally

This may also provide other benefits, like allowing for async reading/writing, and improved worker scheduling of tasks that is sensitive to data that is already in fast memory, or pre-fetching data from slow memory.

0reactions

mrocklincommented, Mar 25, 2021

If this is very effective for shuffle workloads then maybe it’s something that we could implement just for that code path? That might be a tightly scoped place to try this out more broadly.