DataArray.sel extremely slow
See original GitHub issueProblem description
.sel
is an xarray method I use a lot and I would have expected it to fairly efficient.
However, even on tiny DataArrays, it takes seconds.
Code Sample, a copy-pastable example if possible
import timeit
setup = """
import itertools
import numpy as np
import xarray as xr
import string
a = list(string.printable)
b = list(string.ascii_lowercase)
d = xr.DataArray(np.random.rand(len(a), len(b)), coords={'a': a, 'b': b}, dims=['a', 'b'])
d.load()
"""
run = """
for _a, _b in itertools.product(a, b):
d.sel(a=_a, b=_b)
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("xarray", running_times) # e.g. [14.792144000064582, 15.19372400001157, 15.345327000017278]
Expected Output
I would have expected the above code to run in milliseconds.
However, it takes over 10 seconds!
Adding an additional d = d.stack(aa=['a'], bb=['b'])
makes it even slower, about twice as slow.
For reference, a naive dict-indexing implementation in Python takes 0.01 seconds:
setup = """
import itertools
import numpy as np
import string
a = list(string.printable)
b = list(string.ascii_lowercase)
d = np.random.rand(len(a), len(b))
indexers = {'a': {coord: index for (index, coord) in enumerate(a)},
'b': {coord: index for (index, coord) in enumerate(b)}}
"""
run = """
for _a, _b in itertools.product(a, b):
index_a, index_b = indexers['a'][_a], indexers['b'][_b]
item = d[index_a][index_b]
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("dicts", running_times) # e.g. [0.015355999930761755, 0.01466800004709512, 0.014295000000856817]
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-17134-Microsoft
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.1
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: None
distributed: None
matplotlib: 2.2.3
cartopy: None
seaborn: None
setuptools: 40.2.0
pip: 10.0.1
conda: None
pytest: 3.7.4
IPython: 6.5.0
sphinx: None
this is a follow-up from #2438
Issue Analytics
- State:
- Created 5 years ago
- Comments:5
Top Results From Across the Web
Very slow retrieval of values from Dataset - Google Groups
For example, I have a dataset X with a variable in it, va. The command X.va yields: <xarray.DataArray 'va' (time: 1110)> dask.array< ...
Read more >xarray too slow for performance critical code - Stack Overflow
Yes, this is a known limitation for xarray. Performance sensitive code that uses small arrays is much slower for xarray than NumPy.
Read more >Very slow selection of multiple points in large dataset using ...
Hello, I am trying to sample hourly model output to match the time and location of observations collected during a flight. There are...
Read more >Indexing and selecting data - Xarray
The most basic way to access elements of a DataArray object is to use Pytho... ... this type of indexing is significantly slower...
Read more >Indexing and selecting data — xarray 0.10.3 documentation
The most basic way to access elements of a DataArray object is to use Python's ... this type of indexing is significantly slower...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I posted a manual solution to the multi-dimensional grouping in the stackoverflow thread. Hopefully,
.sel
can be made more efficient though, it’s such an everyday method.Thanks @mschrimpf. Hopefully we can get multi-dimensional groupbys, too.