question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel map/apply powered by dask.array

See original GitHub issue

Dask is awesome, but it isn’t always easy to use it for parallel operations. In many cases, especially when wrapping routines from external libraries, it is most straightforward to express operations in terms of a function that expects and returns xray objects loaded into memory.

Dask array has a map_blocks function/method, but it’s applicability is limited because dask.array doesn’t have axis names for unambiguously identifying dimensions. da.atop can handle many of these cases, but it’s not the easiest to use. Fortunately, we have sufficient metadata in xray that we could probably parallelize many atop operations automatically by inferring result dimensions and dtypes from applying the function once. See here for more discussion on the dask side: https://github.com/blaze/dask/issues/702

So I would like to add some convenience methods for automatic parallelization with dask of a function defined on xray objects loaded into memory. In addition to a map_blocks method/function, it would be useful to add some sort of parallel_apply method to groupby objects that works very similarly, by lazily applying a function that takes and returns xray objects loaded into memory.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
shoyercommented, Aug 24, 2017

I have a preliminary implementation up in https://github.com/pydata/xarray/pull/1517

1reaction
shoyercommented, Sep 22, 2016

I think #964 provides a viable path forward here.

Previously, I was imagining the user provides an function that maps xarray.DataArray -> xarray.DataArray. Such functions are tricky to parallelize with dask.array because need to run them to figure out the result dimensions/coordinates.

In contrast, with a user defined function ndarray -> ndarray, it’s fairly straightforward to parallelize these with dask array (e.g., using dask.array.elemwise or dask.array.map_blocks). Then we could add the metadata back in afterwards with #964.

In principle, we could do this automatically – especially if dask had a way to parallelize arbitrary NumPy generalized universal functions. Then the user could write something like xarray.apply(func, data, signature=signature, dask_array='auto') to automatically parallelize func over their data. In fact, I had this in some previous commits for #964, but took it out for now, just to reduce scope for the change.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Overlapping Computations - Dask documentation
Dask Array supports these operations by creating a new array where each block is slightly ... Map a function over blocks of arrays...
Read more >
Parallelize pandas apply() and map() with Dask DataFrame
You can use pandas' apply() function to apply any in-built or custom Python function across a pandas one-dimensional array, i.e., a Series or...
Read more >
Parallel Computing with Dash and Dask - Python
Enable scalable, parallel computing for your Dash app with Dask.
Read more >
Parallel processing with Dask - Digital Earth Africa User Guide
At the very top are the indexes of the chunks that will make up the final array. Adding more tasks¶. The power of...
Read more >
Dask | Encyclopedia MDPI
Dask Array : Parallel NumPy arrays; Dask Bag: Parallel Python lists ... Dask does not power XGBoost or LightGBM, rather it facilitates ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found