Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel tasks on subsets of a dask array wrapped in an xarray Dataset

See original GitHub issue

I have a large xarray.Dataset stored as a zarr. I want to perform some custom operations on it that cannot be done by just using numpy-like functions that a Dask cluster will automatically deal with. Therefore, I partition the dataset into small subsets and for each subset submit to my Dask cluster a task of the form

def my_task(zarr_path, subset_index):
    ds = xarray.open_zarr(zarr_path)  # this returns an xarray.Dataset containing a dask.array
    sel = ds.sel(partition_index)
    sel  = sel.load()  # I want to get the data into memory
    # then do my custom operations
    ...

However, I have noticed this creates a “task within a task”: when a worker receives “my_task”, it in turn submits tasks to the cluster to load the relevant part of the dataset. To avoid this and ensure that the full task is executed within the worker, I am submitting instead the task:

def my_task_2(zarr_path, subset_index):
    with dask.config.set(scheduler="threading"):
        my_task(zarr_path, subset_index)

Is this the best way to do this? What’s the best practice for this kind of situation?

I have already posted this on stackoverflow but did not get any answer, so I am adding this here hoping it increases visibility. Apologies if this is considered “pollution”. https://stackoverflow.com/questions/62874267/parallel-tasks-on-subsets-of-a-dask-array-wrapped-in-an-xarray-dataset

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

rabernatcommented, Jul 22, 2020

The reason is that my function here must be applied along the time dimension (e.g., a rolling median in time), but my data is chunked across the time dimension

This is a fundamental problem that is rather hard to solve without creating a copy of the data.

We just released the rechunker package, which makes it easy to create a copy of your data with a different chunking scheme (e.g contiguous in time, chunked in space). If you have enough disk space to store a copy, this might be a good solution.

1reaction

dcheriancommented, Jul 22, 2020

You could try dask’s map_overlap to share “halo” or Ghost points between chunks. Also see https://image.dask.org/en/latest/dask_image.ndfilters.html#dask_image.ndfilters.median_filter

Top Results From Across the Web

Parallel tasks on subsets of a dask array wrapped in an ...

1 Answer 1 ... It's common to use methods like apply_ufunc or map_blocks to apply a function in parallel across blocks in an...

Parallel computing with Dask - Xarray

Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory. Unlike...

Xarray with Dask Arrays

Xarray is an open source project and Python package that extends the labeled data functionality of Pandas to N-dimensional array-like datasets.

Parallel processing with Dask

Lazy-loading changes the data structure returned from the dc.load() command: the returned xarray.Dataset will be comprised of dask.array objects.

xarray to dask dataframe

Xarray integrates with Dask to support parallel computations and streaming computation on datasets that dont fit into memory. Making statements based on opinion ......