Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataArray.sel extremely slow

See original GitHub issue

Problem description

.sel is an xarray method I use a lot and I would have expected it to fairly efficient. However, even on tiny DataArrays, it takes seconds.

Code Sample, a copy-pastable example if possible

import timeit

setup = """
import itertools
import numpy as np
import xarray as xr
import string

a = list(string.printable)
b = list(string.ascii_lowercase)
d = xr.DataArray(np.random.rand(len(a), len(b)), coords={'a': a, 'b': b}, dims=['a', 'b'])
d.load()
"""

run = """
for _a, _b in itertools.product(a, b):
    d.sel(a=_a, b=_b)
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("xarray", running_times)  # e.g. [14.792144000064582, 15.19372400001157, 15.345327000017278]

Expected Output

I would have expected the above code to run in milliseconds. However, it takes over 10 seconds! Adding an additional d = d.stack(aa=['a'], bb=['b']) makes it even slower, about twice as slow.

For reference, a naive dict-indexing implementation in Python takes 0.01 seconds:

setup = """
import itertools
import numpy as np
import string

a = list(string.printable)
b = list(string.ascii_lowercase)

d = np.random.rand(len(a), len(b))
indexers = {'a': {coord: index for (index, coord) in enumerate(a)},
            'b': {coord: index for (index, coord) in enumerate(b)}}
"""

run = """
for _a, _b in itertools.product(a, b):
    index_a, index_b = indexers['a'][_a], indexers['b'][_b]
    item = d[index_a][index_b]
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("dicts", running_times)  # e.g. [0.015355999930761755, 0.01466800004709512, 0.014295000000856817]

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-17134-Microsoft machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: None cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 40.2.0 pip: 10.0.1 conda: None pytest: 3.7.4 IPython: 6.5.0 sphinx: None

this is a follow-up from #2438

Issue Analytics

State:
Created 5 years ago
Comments:5

Top GitHub Comments

2reactions

mschrimpfcommented, Oct 2, 2018

I posted a manual solution to the multi-dimensional grouping in the stackoverflow thread. Hopefully, .sel can be made more efficient though, it’s such an everyday method.

1reaction

max-sixtycommented, Oct 2, 2018

Thanks @mschrimpf. Hopefully we can get multi-dimensional groupbys, too.