Parallel tasks on subsets of a dask array wrapped in an xarray Dataset
See original GitHub issueI have a large xarray.Dataset stored as a zarr. I want to perform some custom operations on it that cannot be done by just using numpy-like functions that a Dask cluster will automatically deal with. Therefore, I partition the dataset into small subsets and for each subset submit to my Dask cluster a task of the form
def my_task(zarr_path, subset_index):
ds = xarray.open_zarr(zarr_path) # this returns an xarray.Dataset containing a dask.array
sel = ds.sel(partition_index)
sel = sel.load() # I want to get the data into memory
# then do my custom operations
...
However, I have noticed this creates a “task within a task”: when a worker receives “my_task”, it in turn submits tasks to the cluster to load the relevant part of the dataset. To avoid this and ensure that the full task is executed within the worker, I am submitting instead the task:
def my_task_2(zarr_path, subset_index):
with dask.config.set(scheduler="threading"):
my_task(zarr_path, subset_index)
Is this the best way to do this? What’s the best practice for this kind of situation?
I have already posted this on stackoverflow but did not get any answer, so I am adding this here hoping it increases visibility. Apologies if this is considered “pollution”. https://stackoverflow.com/questions/62874267/parallel-tasks-on-subsets-of-a-dask-array-wrapped-in-an-xarray-dataset
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
This is a fundamental problem that is rather hard to solve without creating a copy of the data.
We just released the rechunker package, which makes it easy to create a copy of your data with a different chunking scheme (e.g contiguous in time, chunked in space). If you have enough disk space to store a copy, this might be a good solution.
You could try dask’s
map_overlap
to share “halo” or Ghost points between chunks. Also see https://image.dask.org/en/latest/dask_image.ndfilters.html#dask_image.ndfilters.median_filter