RFC: explicit shared memory
See original GitHub issueWith the increasing availability of large machines, it seems to be the case that more workloads are being run as many processes on a single node. In a workflow where a single large array would be passed to workers, currently this might be done by passing the array from the client (bad), using scatter
(OK) or loading data in the workers (good, but not efficient if we want one big array).
A large memory and transfer cost might be saved by putting the array into posix shared memory and referencing it from the workers. If we host the array is in shm, there is no copy or de/ser cost (but there is an OS call cost to attach to the shm). It could be appropriate for ML workflows where every task wants to make use of the whole of a large dataset (as opposed to chunking the dataset as dask.array operations do). sklearn with joblib is an example where we explicitly recommend scattering large data.
As a really simple example, see my gist, in which the user has to explicitly wrap a numpy array in the client, and then dask workers no longer need to have their own copies. Note that SharedArray
is just a simple way to pass the array metadata as well as its buffer; it works for py37 and probably earlier.
To be clear: there is no suggestion of adding anything to the existing distributed serialisation code, because it’s really hard to try to guess when a user might want to use such a thing. It should be explicitly opt-in.
Further,
- Similar techniques could be used to wrap arrow or pandas data, although no one probably wants to delve through existing in-memory objects to find the underlying buffers.
- Pickle V5 works on buffers and memoryviews, so might be a generic helper here
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:24 (17 by maintainers)
Top GitHub Comments
This has been a recurring need in my projects where the entirety of the data needs to be accessible to all workers yet duplicating the data for each worker would exceed available system memory. I have primarily used NumPy arrays and pandas DataFrames backed by shared memory thus far – it would be cool to use dask as part of this. My situation might only qualify as a single data point but I am not the only weirdo with this sort of need.
+1 on this as well.
I think this already works with
multiprocessing.shared_memory.SharedMemory
– see the 2nd code example in the docs where a NumPy array is backed by shared memory: https://docs.python.org/3/library/multiprocessing.shared_memory.htmlThe implementation behind
multiprocessing.shared_memory
is POSIX shared memory on all systems where that’s available and Named Shared Memory on Windows. This makes for a cross-platform API for shared memory that’s tested and viable on quite a variety of different OSes and platform types.A more general tool to wrap pandas and pandas-like objects was developed prior to the release of Python 3.8 (and the shared memory constructs in the Python Standard Library) but was not suitable for inclusion in the core because it was not general purpose enough even if it was useful.