[dask-on-ray] Memory efficiency: Dask arrays should always be passed by reference only
See original GitHub issueWhen passing a (persisted) dask array with np.ndarray chunks to some ray remote function, the whole dask array does get serialized. I created a notebook showing the consequence: https://gist.github.com/Hoeze/26430689bb3252ebb71b9ae08242d09a
Instead, the np.ndarray chunks should be put to the object store and the dask array should only contain references to the object store.
This could be solved by different ways, e.g.
- some
ray_put_dask
function that converts every chunk to a delayedray.get()
method - change
dask.persist()
to return chunk references instead of np.ndarray-typed chunks
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (9 by maintainers)
Top Results From Across the Web
Best Practices - Dask documentation
It is easy to get started with Dask arrays, but using them well does require some experience. This page contains suggestions for best...
Read more >Using Dask on Ray — Ray 2.2.0
Dask-on-Ray is an ongoing project and is not expected to achieve the same performance as using Ray directly. All Dask abstractions should run...
Read more >Analyzing memory management and performance in Dask-on ...
The goal of this blog is to compare the memory management and performance of "Dask-on-Ray'' versus Dask with its built-in scheduler.
Read more >Tackling unmanaged memory with Dask - Coiled
Since distributed 2021.04.1, the Dask dashboard breaks down the memory usage of each worker and of the cluster total: In the graph we ......
Read more >Parallel computing with Dask - Xarray
Note that xarray only makes use of dask.array and dask.delayed . ... Passing the keyword argument parallel=True to open_mfdataset() will speed up the ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The core issue is that
ray_dask_get
won’t be able to tell if it’s being called bydask.compute()
ordask.persist()
, the call will look exactly the same toray_dask_get
in either case, which is why a user-suppliedray_persist=True
flag would be necessary. And I agree that an explicit function would be better! I think that’s the best option for now.We have an internal api to run the function on every worker. Maybe we should just support ray.util.setup_dask_on_ray() which sets the config on all workers + apply our custom shuffle optimization