Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Smart automatic dask wrapping

See original GitHub issue

Is your feature request related to a problem? Please describe. Creating a dask array from a SparseArray, if chunks are not given, dask will automatically rechunk the data based on its size and the config (dask’s array.chunk-size). The point of sparse arrays is that they can be enormous but still only hold a few values. Dask doesn’t see that. As a result, from a single array many chunks are created, which multiplies the number of tasks in dask’s graph and thus affects performance.

Evidently, when working directly with sparse and dask, one can explicitly give the chunks of the requested dask array, but this is not possible when using it under the hood.

My use case is a xarray.DataArray using a sparse.COO array that is included in a xarray.apply_ufunc call together with a dask-backed DataArray. In that case, xarray sends all inputs to dask.array.apply_gufunc and the wrapping into dask happens in dask.array.core.asarray. Our option is thus to pre-wrap our sparse array to a dask array, before the computation. I think it would be interesting if this was done implicitly.

Describe the solution you’d like The cleanest option I see is to implement SparseArray.to_dask_array. It will be detected and used by dask automatically. There we could wrap to a dask array taking into account that the real size of the array is from .nnz and not .shape. Optionally, we could read the config of dask to respect array.chunk-size.

Describe alternatives you’ve considered Alternatives are:

Handling this in our function explicitly.
Handling this in xarray.
Handling this in dask (we might be able to cover scipy sparse arrays as well?).

But I felt that here was the best place.

Additional context Raised by issue pangeo-data/xesmf#127.

Example

import sparse as sp
import dask.array as da

A = sp.COO([[0, 5000, 10000], [0, 5000, 10000]], [1, 2, 3])

da.from_array(A)
# dask.array<array, shape=(10001, 10001), dtype=int64, chunksize=(4096, 4096), chunktype=sparse.COO>

da.from_array(A, chunks={})
# dask.array<array, shape=(10001, 10001), dtype=int64, chunksize=(10001, 10001), chunktype=sparse.COO>

Issue Analytics

State:
Created 2 years ago
Comments:6

Top GitHub Comments

1reaction

mrocklincommented, Nov 11, 2021

I recommend raising upstream. Pinging me directly is no longer a reliable way to report issues to Dask.

On Thu, Nov 11, 2021, 1:34 PM Matthew Rocklin @.***> wrote:

Probably somewhere in Dask we look at x.nbytes when instead we should call nbytes(x) . A fix upstream for that would be welcome.

On Wed, Nov 10, 2021, 7:05 AM Hameer Abbasi @.***> wrote:

I’d be willing to review accept, and guide a PR for this (with the proposed solution), but I don’t know enough about load-balancing to create a solution that is good enough.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/sparse/issues/530#issuecomment-965347312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHDEN5XGMUOJTMCPYLULKC2JANCNFSM5HMATQUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

1reaction

mrocklincommented, Nov 11, 2021

Probably somewhere in Dask we look at x.nbytes when instead we should call nbytes(x) . A fix upstream for that would be welcome.

On Wed, Nov 10, 2021, 7:05 AM Hameer Abbasi @.***> wrote:

I’d be willing to review accept, and guide a PR for this (with the proposed solution), but I don’t know enough about load-balancing to create a solution that is good enough.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/sparse/issues/530#issuecomment-965347312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHDEN5XGMUOJTMCPYLULKC2JANCNFSM5HMATQUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Top Results From Across the Web

Very large Dask graph size with high resolution grids · Issue #127 ...

Regridding from 0.1 to 0.25 degree global grids is creating a very large task graph with xesmf 0.6.1, causing problems when evaluating the...

SmartDesk 2 Home Edition Assembly Guide | Autonomous

The Autonomous SmartDesk 2 Home Edition is one of the most affordable desk options on the market. Available in 2 sizes and in...

FLEXISPOT EC1 Essential Electric Height Adjustable ...

Buy FLEXISPOT EC1 Essential Electric Height Adjustable Standing Desk Heavy Duty Steel Stand Up Desk Frame w/Automatic Smart Keypad (EC1 Classic Black Frame ......

Dask: Out-of-memory machine learning - Balluff Blog

Dask data frames wrap multiple pandas data frames, so it is not a huge step to scale the processing out over several nodes....

Dask - Matthew Rocklin

Native parallelism in Python; Scales Numpy, Pandas, and Scikit-Learn; Supports arbitrary user-defined task graphs; Powered by task scheduling.