question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[dask-on-ray] Memory efficiency: Dask arrays should always be passed by reference only

See original GitHub issue

When passing a (persisted) dask array with np.ndarray chunks to some ray remote function, the whole dask array does get serialized. I created a notebook showing the consequence: https://gist.github.com/Hoeze/26430689bb3252ebb71b9ae08242d09a

Instead, the np.ndarray chunks should be put to the object store and the dask array should only contain references to the object store.

This could be solved by different ways, e.g.

  • some ray_put_dask function that converts every chunk to a delayed ray.get() method
  • change dask.persist() to return chunk references instead of np.ndarray-typed chunks

@rkooo567 @clarkzinzow

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
clarkzinzowcommented, Feb 23, 2021

The core issue is that ray_dask_get won’t be able to tell if it’s being called by dask.compute() or dask.persist(), the call will look exactly the same to ray_dask_get in either case, which is why a user-supplied ray_persist=True flag would be necessary. And I agree that an explicit function would be better! I think that’s the best option for now.

1reaction
rkooo567commented, Feb 28, 2021

We have an internal api to run the function on every worker. Maybe we should just support ray.util.setup_dask_on_ray() which sets the config on all workers + apply our custom shuffle optimization

Read more comments on GitHub >

github_iconTop Results From Across the Web

Best Practices - Dask documentation
It is easy to get started with Dask arrays, but using them well does require some experience. This page contains suggestions for best...
Read more >
Using Dask on Ray — Ray 2.2.0
Dask-on-Ray is an ongoing project and is not expected to achieve the same performance as using Ray directly. All Dask abstractions should run...
Read more >
Analyzing memory management and performance in Dask-on ...
The goal of this blog is to compare the memory management and performance of "Dask-on-Ray'' versus Dask with its built-in scheduler.
Read more >
Tackling unmanaged memory with Dask - Coiled
Since distributed 2021.04.1, the Dask dashboard breaks down the memory usage of each worker and of the cluster total: In the graph we ......
Read more >
Parallel computing with Dask - Xarray
Note that xarray only makes use of dask.array and dask.delayed . ... Passing the keyword argument parallel=True to open_mfdataset() will speed up the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found