Very large Dask graph size with high resolution grids
See original GitHub issueRegridding from 0.1 to 0.25 degree global grids is creating a very large task graph with xesmf 0.6.1, causing problems when evaluating the result.
Sample code:
import xarray
import xesmf
import dask.array
a = xesmf.util.grid_global(0.1,0.1)
b = xesmf.util.grid_global(0.25,0.25)
r = xesmf.Regridder(a,b,method='bilinear')
data = dask.array.random.random((720,a.dims['y'], a.dims['x']), chunks=(12,-1,-1))
da = xarray.DataArray(data, dims=['time','y','x'], coords={'y': a.y, 'x': a.x})
r(da).data
Evaluated versions at https://nbviewer.org/gist/ScottWales/4fe0e9a5725b5a3bf07ab94e80778846
With 0.6.0, the regridded data has 240 Dask tasks. With 0.6.1 the number has exploded to 409,841 tasks, and appears to scale with the grid resolution - low resolution regrid task graphs have a reasonable size.
Expected behaviour is that the Dask task graph created by xesmf is a reasonably small multiple of the number of chunks in the input data, and that it is independent of grid resolution. Having a too large task graph slows down Dask as it prepares the tasks to run and is causing excess memory use when the result is evaluated.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:9 (6 by maintainers)
Top GitHub Comments
@dcherian thanks very much for this valuable feedback, @aulemahal do you have time to look into this?
Argh… Sorry guys!
Ok so I did see that problem in the original PR about
sparse
, but I guess I forgot and we didn’t discuss it. @huard spotted the issue. Because of the tremendous size of the weight array :n_in x n_out
, dask tries to rechunk it on the fly to limit the chunk size. It doesn’t see that the arrays are sparse.@ScottWales : the workaround is to config dask to accept large chunks. Since the chunks are sparse, there is no memory explosion. Ex:
I am currently digging dask’s code and doc. I found the following:
The weights are transformed to a dask array in
dask.array.apply_gufunc
which is called inxr.apply_ufunc
when any input uses the dask backed. Precisely, indask.array.core.asarray
.There, dask doesn’t recognize the sparse array as something special and converts it to dask using the same as it was a numpy array. THUS, the size is computed as
np.prod(shape)
and it is rechunked to smaller elements. This is the problematic and useless part.(It didn’t happen before because we where passing the weights as a
kwargs
ofapply_ufunc
, instead of as a normal input. This had the ill effect of copying the weights as many times as there wereregrid_array
tasks. This is why we added the sparse backend.)The solution is to have dask measure the size of the array from the number of non-zero elements, not from its shape. I see a few ways forward:
sparse
but alsoscipy
) and judge the appropriate chunking acoordingly.to_dask_array()
onsparse.SparseArray
to handle this (dask’sasarray
will detect it).xr.core.computation.apply_variable_ufunc
)I personally believe (2) is the cleanest way, but we have to ask people at
sparse
. (4) would solve our problem quickly as it is easy to implement.