Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Allowing setitem-like operation on dask array

See original GitHub issue

Even though stack and concatenate are nice for combining arrays, sometimes they don’t fit the data I have or require a significant amount of work to use. For instance, combining blocks of different data. In cases like these, it would be nice to be able to use array assignment. While it is true that dask creates graphs of pure operations (with few exceptions) and assignment is unpure, one could imagine creating an array-like object that translates assignments into slicing and stacking/concatenating. This would allow a user to make use of a __setitem__-like syntax, but result in creating a new dask array (or potentially modifying the graph of the existing one) so the net result behaves like assignment while remaining pure.

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:21 (18 by maintainers)

Top GitHub Comments

2reactions

davidhassellcommented, Apr 13, 2021

Thanks, @jakirkham.

This is largely supported after PR ( 7033 )

Just for the record, in case it’s useful to those who come across this issue in the future, the PR that finally supported this turned out to be #7393, after flaws in 7033 were uncovered.

2reactions

shoyercommented, Jul 2, 2020

There are at least two parts to this issue:

Modifying dask arrays in-place
“Scatter” type operations that perform the NumPy equivalent of z = x.copy(); z[i] = y; return z

Part (1) is arguably the most problematic for dask, because array properties like chunks are expected to be immutable.

Part (2) is the functionality we really need, regardless of how it’s spelled. JAX uses the notation z = x.at[i].set(y).

I believe it could be significantly easier to implement (2) in dask without the baggage of mutable __setitem__ syntax, e.g., so we can feel free to change chunk sizes as appropriate.

It is of course always possible to translate __setitem__ into __getitem__ in user code, but there are a number of cases where this syntax is much more natural. Notable examples include:

“Unstacking” a pandas.MultiIndex into multiple dimensions, like what xarray does for NumPy arrays from pandas in https://github.com/pydata/xarray/pull/4184
Implementing the gradient of indexing in reverse mode autodiff (OK, technically this needs the equivalent of x[i] += y handling repeated indices i, which is implemented in NumPy as np.add.at).
Overriding boundaries of arrays, as noted by @dionhaefner above in https://github.com/dask/dask/issues/2000#issuecomment-359271743

Top Results From Across the Web

Array - Dask documentation

Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays.

Create Dask Arrays - Dask documentation

Dask array operations will automatically convert NumPy arrays into single-chunk dask ... This allows us to build a variety of custom behaviors that...

Best Practices - Dask documentation

It is easy to get started with Dask arrays, but using them well does require some experience. This page contains suggestions for best...

Slicing - Dask documentation

Slicing¶. Dask Array supports most of the NumPy slicing syntax. In particular, it supports the following: Slicing by integers and slices: x[0, :5]....

Chunks - Dask documentation

Operations like the above result in arrays with unknown shapes and unknown chunk sizes. ... Using compute_chunk_sizes() allows this example run: > ...