question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby very slow compared to pandas

See original GitHub issue
import timeit
import numpy as np
from pandas import DataFrame
from xray import Dataset, DataArray

df = DataFrame({"a": np.r_[np.arange(500.), np.arange(500.)],
                "b": np.arange(1000.)})
print(timeit.repeat('df.groupby("a").agg("mean")', globals={"df": df}, number=10))
print(timeit.repeat('df.groupby("a").agg(np.mean)', globals={"df": df, "np": np}, number=10))

ds = Dataset({"a": DataArray(np.r_[np.arange(500.), np.arange(500.)]),
              "b": DataArray(np.arange(1000.))})
print(timeit.repeat('ds.groupby("a").mean()', globals={"ds": ds}, number=10))

This outputs

[0.010462284000823274, 0.009770361997652799, 0.01081446700845845]
[0.02622630601399578, 0.024328112005605362, 0.018717073995503597]
[2.2804569930012804, 2.1666158599982737, 2.2688316510029836]

i.e. xray’s groupby is ~100 times slower than pandas’ one (and 200 times slower than passing "mean" to pandas’ groupby, which I assume involves some specialization).

(This is the actual order or magnitude of the data size and redundancy I want to handle, i.e. thousands of points with very limited duplication.)

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

9reactions
jjpr-mitcommented, Oct 4, 2017

In case anyone gets here by Googling something like “xarray groupby slow” and you loaded data from a netCDF file, be aware that slowness you see in groupby aggregation on a Dataset or DataArray may actually be due not to this issue but to the lazy loading that’s done by default. This can be fixed by calling .load() on the Dataset or DataArray. See the Tip about lazy loading at http://xarray.pydata.org/en/stable/io.html#netcdf.

4reactions
andersy005commented, May 13, 2022

#5734 has greatly improved the performance. Fantastic work @dcherian 👏🏽

In [13]: import xarray as xr, pandas as pd, numpy as np

In [14]: ds = xr.Dataset({"a": xr.DataArray(np.r_[np.arange(500.), np.arange(500.)]),
    ...:               "b": xr.DataArray(np.arange(1000.))})

In [15]: ds
Out[15]: 
<xarray.Dataset>
Dimensions:  (dim_0: 1000)
Dimensions without coordinates: dim_0
Data variables:
    a        (dim_0) float64 0.0 1.0 2.0 3.0 4.0 ... 496.0 497.0 498.0 499.0
    b        (dim_0) float64 0.0 1.0 2.0 3.0 4.0 ... 996.0 997.0 998.0 999.0
In [16]: xr.set_options(use_flox=True)
Out[16]: <xarray.core.options.set_options at 0x104de21a0>

In [17]: %%timeit
    ...: ds.groupby("a").mean()
    ...: 
    ...: 
1.5 ms ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [18]: xr.set_options(use_flox=False)
Out[18]: <xarray.core.options.set_options at 0x144382350>

In [19]: %%timeit
    ...: ds.groupby("a").mean()
    ...: 
    ...: 
94 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Pandas: df.groupby() is too slow for big data set. Any ...
The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:
Read more >
DataFrame groupby is extremely slow when grouping by a ...
When a DataFrame column contains pandas.Period values, and the user attempts to groupby this column, the resulting operation is very, very slow, ...
Read more >
Why pandas apply method is slow, and how Terality ...
Let's compare two code snippets computing the sum of the squares of dataframe columns. ‍. -- CODE language-python -- import numpy as np...
Read more >
pandas.core.groupby.GroupBy.apply
While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific...
Read more >
[Solved]-Pandas groupby apply performing slow-Pandas,Python
Accepted answer. The problem, I believe, is that your data has 5300 distinct groups. Due to this, anything slow within your function will...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found