Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[dask-on-ray] Memory efficiency: Dask arrays should always be passed by reference only

See original GitHub issue

When passing a (persisted) dask array with np.ndarray chunks to some ray remote function, the whole dask array does get serialized. I created a notebook showing the consequence: https://gist.github.com/Hoeze/26430689bb3252ebb71b9ae08242d09a

Instead, the np.ndarray chunks should be put to the object store and the dask array should only contain references to the object store.

This could be solved by different ways, e.g.

some ray_put_dask function that converts every chunk to a delayed ray.get() method
change dask.persist() to return chunk references instead of np.ndarray-typed chunks

@rkooo567 @clarkzinzow

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:12 (9 by maintainers)

Top GitHub Comments

2reactions

clarkzinzowcommented, Feb 23, 2021

The core issue is that ray_dask_get won’t be able to tell if it’s being called by dask.compute() or dask.persist(), the call will look exactly the same to ray_dask_get in either case, which is why a user-supplied ray_persist=True flag would be necessary. And I agree that an explicit function would be better! I think that’s the best option for now.

1reaction

rkooo567commented, Feb 28, 2021

We have an internal api to run the function on every worker. Maybe we should just support ray.util.setup_dask_on_ray() which sets the config on all workers + apply our custom shuffle optimization

Top Results From Across the Web

Best Practices - Dask documentation

It is easy to get started with Dask arrays, but using them well does require some experience. This page contains suggestions for best...

Using Dask on Ray — Ray 2.2.0

Dask-on-Ray is an ongoing project and is not expected to achieve the same performance as using Ray directly. All Dask abstractions should run...

Analyzing memory management and performance in Dask-on ...

The goal of this blog is to compare the memory management and performance of "Dask-on-Ray'' versus Dask with its built-in scheduler.

Tackling unmanaged memory with Dask - Coiled

Since distributed 2021.04.1, the Dask dashboard breaks down the memory usage of each worker and of the cluster total: In the graph we ......

Parallel computing with Dask - Xarray

Note that xarray only makes use of dask.array and dask.delayed . ... Passing the keyword argument parallel=True to open_mfdataset() will speed up the ......