Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby very slow compared to pandas

See original GitHub issue

import timeit
import numpy as np
from pandas import DataFrame
from xray import Dataset, DataArray

df = DataFrame({"a": np.r_[np.arange(500.), np.arange(500.)],
                "b": np.arange(1000.)})
print(timeit.repeat('df.groupby("a").agg("mean")', globals={"df": df}, number=10))
print(timeit.repeat('df.groupby("a").agg(np.mean)', globals={"df": df, "np": np}, number=10))

ds = Dataset({"a": DataArray(np.r_[np.arange(500.), np.arange(500.)]),
              "b": DataArray(np.arange(1000.))})
print(timeit.repeat('ds.groupby("a").mean()', globals={"ds": ds}, number=10))

This outputs

[0.010462284000823274, 0.009770361997652799, 0.01081446700845845]
[0.02622630601399578, 0.024328112005605362, 0.018717073995503597]
[2.2804569930012804, 2.1666158599982737, 2.2688316510029836]

i.e. xray’s groupby is ~100 times slower than pandas’ one (and 200 times slower than passing "mean" to pandas’ groupby, which I assume involves some specialization).

(This is the actual order or magnitude of the data size and redundancy I want to handle, i.e. thousands of points with very limited duplication.)

Issue Analytics

State:
Created 8 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

9reactions

jjpr-mitcommented, Oct 4, 2017

In case anyone gets here by Googling something like “xarray groupby slow” and you loaded data from a netCDF file, be aware that slowness you see in groupby aggregation on a Dataset or DataArray may actually be due not to this issue but to the lazy loading that’s done by default. This can be fixed by calling .load() on the Dataset or DataArray. See the Tip about lazy loading at http://xarray.pydata.org/en/stable/io.html#netcdf.

4reactions

andersy005commented, May 13, 2022

#5734 has greatly improved the performance. Fantastic work @dcherian 👏🏽

In [13]: import xarray as xr, pandas as pd, numpy as np

In [14]: ds = xr.Dataset({"a": xr.DataArray(np.r_[np.arange(500.), np.arange(500.)]),
    ...:               "b": xr.DataArray(np.arange(1000.))})

In [15]: ds
Out[15]: 
<xarray.Dataset>
Dimensions:  (dim_0: 1000)
Dimensions without coordinates: dim_0
Data variables:
    a        (dim_0) float64 0.0 1.0 2.0 3.0 4.0 ... 496.0 497.0 498.0 499.0
    b        (dim_0) float64 0.0 1.0 2.0 3.0 4.0 ... 996.0 997.0 998.0 999.0

In [16]: xr.set_options(use_flox=True)
Out[16]: <xarray.core.options.set_options at 0x104de21a0>

In [17]: %%timeit
    ...: ds.groupby("a").mean()
    ...: 
    ...: 
1.5 ms ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [18]: xr.set_options(use_flox=False)
Out[18]: <xarray.core.options.set_options at 0x144382350>

In [19]: %%timeit
    ...: ds.groupby("a").mean()
    ...: 
    ...: 
94 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)