question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel tasks on subsets of a dask array wrapped in an xarray Dataset

See original GitHub issue

I have a large xarray.Dataset stored as a zarr. I want to perform some custom operations on it that cannot be done by just using numpy-like functions that a Dask cluster will automatically deal with. Therefore, I partition the dataset into small subsets and for each subset submit to my Dask cluster a task of the form

def my_task(zarr_path, subset_index):
    ds = xarray.open_zarr(zarr_path)  # this returns an xarray.Dataset containing a dask.array
    sel = ds.sel(partition_index)
    sel  = sel.load()  # I want to get the data into memory
    # then do my custom operations
    ...

However, I have noticed this creates a “task within a task”: when a worker receives “my_task”, it in turn submits tasks to the cluster to load the relevant part of the dataset. To avoid this and ensure that the full task is executed within the worker, I am submitting instead the task:

def my_task_2(zarr_path, subset_index):
    with dask.config.set(scheduler="threading"):
        my_task(zarr_path, subset_index)

Is this the best way to do this? What’s the best practice for this kind of situation?

I have already posted this on stackoverflow but did not get any answer, so I am adding this here hoping it increases visibility. Apologies if this is considered “pollution”. https://stackoverflow.com/questions/62874267/parallel-tasks-on-subsets-of-a-dask-array-wrapped-in-an-xarray-dataset

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
rabernatcommented, Jul 22, 2020

The reason is that my function here must be applied along the time dimension (e.g., a rolling median in time), but my data is chunked across the time dimension

This is a fundamental problem that is rather hard to solve without creating a copy of the data.

We just released the rechunker package, which makes it easy to create a copy of your data with a different chunking scheme (e.g contiguous in time, chunked in space). If you have enough disk space to store a copy, this might be a good solution.

1reaction
dcheriancommented, Jul 22, 2020

You could try dask’s map_overlap to share “halo” or Ghost points between chunks. Also see https://image.dask.org/en/latest/dask_image.ndfilters.html#dask_image.ndfilters.median_filter

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallel tasks on subsets of a dask array wrapped in an ...
1 Answer 1 ... It's common to use methods like apply_ufunc or map_blocks to apply a function in parallel across blocks in an...
Read more >
Parallel computing with Dask - Xarray
Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory. Unlike...
Read more >
Xarray with Dask Arrays
Xarray is an open source project and Python package that extends the labeled data functionality of Pandas to N-dimensional array-like datasets.
Read more >
Parallel processing with Dask
Lazy-loading changes the data structure returned from the dc.load() command: the returned xarray.Dataset will be comprised of dask.array objects.
Read more >
xarray to dask dataframe
Xarray integrates with Dask to support parallel computations and streaming computation on datasets that dont fit into memory. Making statements based on opinion ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found