question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected chunking behavior when using `xr.align` with `join='outer'`

See original GitHub issue

I just came across some unexpected behavior, when using xr.align with the option join='outer' on two Dataarrays which contain dask.arrays and have different dimension lengths.

MCVE Code Sample

import numpy as np
import xarray as xr

short_time = xr.cftime_range('2000', periods=12)
long_time = xr.cftime_range('2000', periods=120)

data_short = np.random.rand(len(short_time))
data_long = np.random.rand(len(long_time))
a = xr.DataArray(data_short, dims=['time'], coords={'time':short_time}).chunk({'time':3})
b = xr.DataArray(data_long, dims=['time'], coords={'time':long_time}).chunk({'time':3})

a,b = xr.align(a,b, join = 'outer')

Expected Output

As expected a is filled with missing values:

a.plot()
b.plot()

image

But the filled values do not replicate the chunking along the time dimension in b. Instead the padded values are in one single chunk, which can be substantially larger than the others.

a.data

image

b.data

image

(Quick shoutout for the amazing html representation. This made diagnosing this problem super easy! 🥳 )

Problem Description

I think for many problems it would be more appropriate if the padded portion of the array would have a chunking scheme like the longer array.

A practical example (which brought me to this issue) is given in the CMIP6 data archive, where some models give output for several members, with some of them running longer than others, leading to problems when these are combined (see intake-esm/#225). Basically for that particular model, there are 5 members with a runtime of 100 years and one member with a runtime of 300 years. I think using xr.align leads immediately to a chunk that is 200 years long and blows up the memory on all systems I have tried this on.

Is there a way to work around this, or is this behavior intended and I am missing something?

cc’ing @dcherian @andersy005

Versions

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.2 | packaged by conda-forge | (default, Apr 24 2020, 08:20:52) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1127.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4

xarray: 0.15.1 pandas: 1.0.3 numpy: 1.18.4 scipy: 1.4.1 netCDF4: 1.5.3 pydap: None h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.1.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: None bottleneck: None dask: 2.15.0 distributed: 2.15.2 matplotlib: 3.2.1 cartopy: 0.18.0 seaborn: None numbagg: None setuptools: 46.1.3.post20200325 pip: 20.1 conda: None pytest: 5.4.2 IPython: 7.14.0 sphinx: None

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
TomAugspurgercommented, Jun 1, 2020

Rechunking the indexer array is how I would be explicit about the desired chunk size. Opened https://github.com/dask/dask/issues/6270 to discuss this on the dask side.

1reaction
dcheriancommented, May 30, 2020

Great diagnosis @jbusecke .

Ultimately this comes down to dask indexing

import dask.array

arr = dask.array.from_array([0, 1, 2, 3], chunks=(1,))
print(arr.chunks)  # ((1, 1, 1, 1),)
# align calls reindex which indexes with something like this
indexer = [0, 1, 2, 3, ] + [-1,] * 111
print(arr[indexer].chunks)  # ((1, 1, 1, 112),)

# maybe something like this is a solution
lazy_indexer = dask.array.from_array(indexer, chunks=arr.chunks[0][0], name="idx")
print(arr[lazy_indexer].chunks) # ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),)

cc @TomAugspurger, the issue here is that big 112 size chunk takes down the cluster in https://github.com/NCAR/intake-esm/issues/225

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unexpected behaviour when chunking with multiple netcdf ...
My RAM is 8GB so I need chunking. I'm creating a xarray dataset using xarray.open_mfdataset and I have found that using the parameter...
Read more >
What's New - Xarray
Variables which are chunked using dask in larger (but aligned) chunks than the target zarr chunk size can now be stored using to_zarr()...
Read more >
getting rows N through M of a result set - Ask TOM
You Asked. I would like to fetch data after joining 3 tables and sorting based on some field. As this query results into...
Read more >
Spatial mapping - Mixed Reality | Microsoft Learn
By merging the real world with the virtual world, an application can make holograms seem real. Applications can also more naturally align with ......
Read more >
DRAGEN Bio-IT Platform 3.7 User Guide (document ...
optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. They also use platform features such ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found