Parallel map/apply powered by dask.array
See original GitHub issueDask is awesome, but it isn’t always easy to use it for parallel operations. In many cases, especially when wrapping routines from external libraries, it is most straightforward to express operations in terms of a function that expects and returns xray objects loaded into memory.
Dask array has a map_blocks
function/method, but it’s applicability is limited because dask.array doesn’t have axis names for unambiguously identifying dimensions. da.atop
can handle many of these cases, but it’s not the easiest to use. Fortunately, we have sufficient metadata in xray that we could probably parallelize many atop
operations automatically by inferring result dimensions and dtypes from applying the function once. See here for more discussion on the dask side: https://github.com/blaze/dask/issues/702
So I would like to add some convenience methods for automatic parallelization with dask of a function defined on xray objects loaded into memory. In addition to a map_blocks
method/function, it would be useful to add some sort of parallel_apply
method to groupby objects that works very similarly, by lazily applying a function that takes and returns xray objects loaded into memory.
Issue Analytics
- State:
- Created 8 years ago
- Comments:11 (9 by maintainers)
Top GitHub Comments
I have a preliminary implementation up in https://github.com/pydata/xarray/pull/1517
I think #964 provides a viable path forward here.
Previously, I was imagining the user provides an function that maps
xarray.DataArray
->xarray.DataArray
. Such functions are tricky to parallelize with dask.array because need to run them to figure out the result dimensions/coordinates.In contrast, with a user defined function
ndarray
->ndarray
, it’s fairly straightforward to parallelize these with dask array (e.g., usingdask.array.elemwise
ordask.array.map_blocks
). Then we could add the metadata back in afterwards with #964.In principle, we could do this automatically – especially if dask had a way to parallelize arbitrary NumPy generalized universal functions. Then the user could write something like
xarray.apply(func, data, signature=signature, dask_array='auto')
to automatically parallelize func over their data. In fact, I had this in some previous commits for #964, but took it out for now, just to reduce scope for the change.