Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Very large Dask graph size with high resolution grids

See original GitHub issue

Regridding from 0.1 to 0.25 degree global grids is creating a very large task graph with xesmf 0.6.1, causing problems when evaluating the result.

Sample code:

import xarray
import xesmf
import dask.array

a = xesmf.util.grid_global(0.1,0.1)
b = xesmf.util.grid_global(0.25,0.25)
r = xesmf.Regridder(a,b,method='bilinear')

data = dask.array.random.random((720,a.dims['y'], a.dims['x']), chunks=(12,-1,-1))
da = xarray.DataArray(data, dims=['time','y','x'], coords={'y': a.y, 'x': a.x})

r(da).data

Evaluated versions at https://nbviewer.org/gist/ScottWales/4fe0e9a5725b5a3bf07ab94e80778846

With 0.6.0, the regridded data has 240 Dask tasks. With 0.6.1 the number has exploded to 409,841 tasks, and appears to scale with the grid resolution - low resolution regrid task graphs have a reasonable size.

Expected behaviour is that the Dask task graph created by xesmf is a reasonably small multiple of the number of chunks in the input data, and that it is independent of grid resolution. Having a too large task graph slows down Dask as it prepares the tasks to run and is causing excess memory use when the result is evaluated.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

raphaeldussincommented, Nov 10, 2021

@dcherian thanks very much for this valuable feedback, @aulemahal do you have time to look into this?

1reaction

aulemahalcommented, Nov 4, 2021

Argh… Sorry guys!

Ok so I did see that problem in the original PR about sparse, but I guess I forgot and we didn’t discuss it. @huard spotted the issue. Because of the tremendous size of the weight array : n_in x n_out, dask tries to rechunk it on the fly to limit the chunk size. It doesn’t see that the arrays are sparse.

@ScottWales : the workaround is to config dask to accept large chunks. Since the chunks are sparse, there is no memory explosion. Ex:

from dask import config
config.set({'array.chunk-size': '500GiB'})

I am currently digging dask’s code and doc. I found the following:

The weights are transformed to a dask array in dask.array.apply_gufunc which is called in xr.apply_ufunc when any input uses the dask backed. Precisely, in dask.array.core.asarray.

There, dask doesn’t recognize the sparse array as something special and converts it to dask using the same as it was a numpy array. THUS, the size is computed as np.prod(shape) and it is rechunked to smaller elements. This is the problematic and useless part.

(It didn’t happen before because we where passing the weights as a kwargs of apply_ufunc, instead of as a normal input. This had the ill effect of copying the weights as many times as there were regrid_array tasks. This is why we added the sparse backend.)

The solution is to have dask measure the size of the array from the number of non-zero elements, not from its shape. I see a few ways forward:

Implement something in dask to recognize “sparse” arrays (sparse but also scipy) and judge the appropriate chunking acoordingly.
Implement to_dask_array() on sparse.SparseArray to handle this (dask’s asarray will detect it).
Handle this in xarray (in xr.core.computation.apply_variable_ufunc)
Handle this ourselves, simply by detecting that the regridding input uses dask and then creating the weights array ourselves, using a single chunk.

I personally believe (2) is the cleanest way, but we have to ask people at sparse. (4) would solve our problem quickly as it is easy to implement.

Top Results From Across the Web

High Level Graphs - Dask documentation

Dask graphs produced by collections like Arrays, Bags, and DataFrames have high-level structure that can be useful for visualization and high-level optimization ...

Harvard AC295 | Lecture 4: Dask

Dask can be used to scale popular Python libraries such as Pandas and NumPy allowing to analyse dataset with greater size (>8GB) ·...

dealing with very large dask graphs · Issue #266 - GitHub

Some thoughts: The dask graph itself must be huge. The workers only have 6GB of memory, and the notebook 14. Everyone is running...

Working With Large Internal Link Graphs in Python - Briggsby

The difficulty of this can scale with size. Extraneous data: Exports may include more data than we need for analysis, such as extra...

Introduction to Dask: Scaling EDA & ML Workloads - YouTube

While " Big Data" may be an overhyped buzzword, it's not uncommon for Python users to end up with more data than can...