question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Integration with dask/distributed (xarray backend design)

See original GitHub issue

Dask (https://github.com/dask/dask) currently provides on-node parallelism for medium-size data problems. However, large climate data sets will require multiple-node parallelism to analyze large climate data sets because this constitutes a big data problem. A likely solution to this issue is integration of distributed (https://github.com/dask/distributed) with dask. Distributed is now integrated with dask and its benefits are already starting to be realized, e.g., see http://matthewrocklin.com/blog/work/2016/02/26/dask-distributed-part-3.

Thus, this issue is designed to identify the steps needed to perform this integration, at a high-level. As stated by @shoyer, it will

definitely require some refactoring of the xarray backend system to make this work cleanly, but that’s OK – the xarray backend system is indicated as experimental/internal API precisely because we hadn’t figured out all the use cases yet."

To be honest, I’ve never been entirely happy with the design we took there (we use inheritance rather than composition for backend classes), but we did get it to work for our use cases. Some refactoring with an eye towards compatibility with dask distributed seems like a very worthwhile endeavor. We do have the benefit of a pretty large test suite covering existing use cases.

Thus, we have the chance to make xarray big-data capable as well as provide improvements to the backend.

To this end, I’m starting this issue to help begin the design process following the xarray mailing list discussion some of us have been having (@shoyer, @mrocklin, @rabernat).

Task To Do List:

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:59 (54 by maintainers)

github_iconTop GitHub Comments

1reaction
jhammancommented, Jan 13, 2019

Closing this old issue. The final checkbox in @pwolfram’s original post was completed in #2261.

1reaction
shoyercommented, Oct 29, 2016

Distributed Dask.array could possibly replace OpenDAP in some settings though

Yes, this sounds quite promising to me.

Using OpenDAP for communication is also possible, but if all we need to do is pass around serialized xarray.Dataset objects using pickle or even bytes from netCDF files seems more promising.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallel computing with Dask - Xarray
Xarray integrates with Dask to support parallel computations and streaming ... with Dask's distributed scheduler is only supported for the netcdf4 backend.
Read more >
Integration with dask/distributed (xarray backend design)
Coming soon: A brand new website interface for an even better experience!
Read more >
Connect to remote data - Dask documentation
This file system backs many clusters running Hadoop and Spark. HDFS support can be provided by PyArrow. By default, the back-end attempts to...
Read more >
Integrating Cerebro with Dask
using the Dask backend in a distributed cluster environ- ... This section describes the system design of integrating Dask with Cerebro.
Read more >
Using Dask on Ray — Ray 2.2.0
If you'd like to create data analyses using the familiar NumPy and Pandas APIs provided by Dask and execute them on a fast,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found