question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataArray.sel extremely slow

See original GitHub issue

Problem description

.sel is an xarray method I use a lot and I would have expected it to fairly efficient. However, even on tiny DataArrays, it takes seconds.

Code Sample, a copy-pastable example if possible

import timeit

setup = """
import itertools
import numpy as np
import xarray as xr
import string

a = list(string.printable)
b = list(string.ascii_lowercase)
d = xr.DataArray(np.random.rand(len(a), len(b)), coords={'a': a, 'b': b}, dims=['a', 'b'])
d.load()
"""

run = """
for _a, _b in itertools.product(a, b):
    d.sel(a=_a, b=_b)
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("xarray", running_times)  # e.g. [14.792144000064582, 15.19372400001157, 15.345327000017278]

Expected Output

I would have expected the above code to run in milliseconds. However, it takes over 10 seconds! Adding an additional d = d.stack(aa=['a'], bb=['b']) makes it even slower, about twice as slow.

For reference, a naive dict-indexing implementation in Python takes 0.01 seconds:

setup = """
import itertools
import numpy as np
import string

a = list(string.printable)
b = list(string.ascii_lowercase)

d = np.random.rand(len(a), len(b))
indexers = {'a': {coord: index for (index, coord) in enumerate(a)},
            'b': {coord: index for (index, coord) in enumerate(b)}}
"""

run = """
for _a, _b in itertools.product(a, b):
    index_a, index_b = indexers['a'][_a], indexers['b'][_b]
    item = d[index_a][index_b]
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("dicts", running_times)  # e.g. [0.015355999930761755, 0.01466800004709512, 0.014295000000856817]

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-17134-Microsoft machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: None cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 40.2.0 pip: 10.0.1 conda: None pytest: 3.7.4 IPython: 6.5.0 sphinx: None

this is a follow-up from #2438

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
mschrimpfcommented, Oct 2, 2018

I posted a manual solution to the multi-dimensional grouping in the stackoverflow thread. Hopefully, .sel can be made more efficient though, it’s such an everyday method.

1reaction
max-sixtycommented, Oct 2, 2018

Thanks @mschrimpf. Hopefully we can get multi-dimensional groupbys, too.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Very slow retrieval of values from Dataset - Google Groups
For example, I have a dataset X with a variable in it, va. The command X.va yields: <xarray.DataArray 'va' (time: 1110)> dask.array< ...
Read more >
xarray too slow for performance critical code - Stack Overflow
Yes, this is a known limitation for xarray. Performance sensitive code that uses small arrays is much slower for xarray than NumPy.
Read more >
Very slow selection of multiple points in large dataset using ...
Hello, I am trying to sample hourly model output to match the time and location of observations collected during a flight. There are...
Read more >
Indexing and selecting data - Xarray
The most basic way to access elements of a DataArray object is to use Pytho... ... this type of indexing is significantly slower...
Read more >
Indexing and selecting data — xarray 0.10.3 documentation
The most basic way to access elements of a DataArray object is to use Python's ... this type of indexing is significantly slower...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found