Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected chunking behavior when using `xr.align` with `join='outer'`

See original GitHub issue

I just came across some unexpected behavior, when using xr.align with the option join='outer' on two Dataarrays which contain dask.arrays and have different dimension lengths.

MCVE Code Sample

import numpy as np
import xarray as xr

short_time = xr.cftime_range('2000', periods=12)
long_time = xr.cftime_range('2000', periods=120)

data_short = np.random.rand(len(short_time))
data_long = np.random.rand(len(long_time))
a = xr.DataArray(data_short, dims=['time'], coords={'time':short_time}).chunk({'time':3})
b = xr.DataArray(data_long, dims=['time'], coords={'time':long_time}).chunk({'time':3})

a,b = xr.align(a,b, join = 'outer')

Expected Output

As expected a is filled with missing values:

a.plot()
b.plot()

But the filled values do not replicate the chunking along the time dimension in b. Instead the padded values are in one single chunk, which can be substantially larger than the others.

a.data

b.data

(Quick shoutout for the amazing html representation. This made diagnosing this problem super easy! 🥳 )

Problem Description

I think for many problems it would be more appropriate if the padded portion of the array would have a chunking scheme like the longer array.

A practical example (which brought me to this issue) is given in the CMIP6 data archive, where some models give output for several members, with some of them running longer than others, leading to problems when these are combined (see intake-esm/#225). Basically for that particular model, there are 5 members with a runtime of 100 years and one member with a runtime of 300 years. I think using xr.align leads immediately to a chunk that is 200 years long and blows up the memory on all systems I have tried this on.

Is there a way to work around this, or is this behavior intended and I am missing something?

cc’ing @dcherian @andersy005

Versions

Output of <tt>xr.show_versions()</tt>

INSTALLED VERSIONS ------------------ commit: None python: 3.8.2 | packaged by conda-forge | (default, Apr 24 2020, 08:20:52) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1127.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4

xarray: 0.15.1 pandas: 1.0.3 numpy: 1.18.4 scipy: 1.4.1 netCDF4: 1.5.3 pydap: None h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.1.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: None dask: 2.15.0 distributed: 2.15.2 matplotlib: 3.2.1 cartopy: 0.18.0 seaborn: None numbagg: None setuptools: 46.1.3.post20200325 pip: 20.1 conda: None pytest: 5.4.2 IPython: 7.14.0 sphinx: None

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

2reactions

TomAugspurgercommented, Jun 1, 2020

Rechunking the indexer array is how I would be explicit about the desired chunk size. Opened https://github.com/dask/dask/issues/6270 to discuss this on the dask side.

1reaction

dcheriancommented, May 30, 2020

Great diagnosis @jbusecke .

Ultimately this comes down to dask indexing

import dask.array

arr = dask.array.from_array([0, 1, 2, 3], chunks=(1,))
print(arr.chunks)  # ((1, 1, 1, 1),)
# align calls reindex which indexes with something like this
indexer = [0, 1, 2, 3, ] + [-1,] * 111
print(arr[indexer].chunks)  # ((1, 1, 1, 112),)

# maybe something like this is a solution
lazy_indexer = dask.array.from_array(indexer, chunks=arr.chunks[0][0], name="idx")
print(arr[lazy_indexer].chunks) # ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),)

cc @TomAugspurger, the issue here is that big 112 size chunk takes down the cluster in https://github.com/NCAR/intake-esm/issues/225