groupby very slow compared to pandas
See original GitHub issueimport timeit
import numpy as np
from pandas import DataFrame
from xray import Dataset, DataArray
df = DataFrame({"a": np.r_[np.arange(500.), np.arange(500.)],
"b": np.arange(1000.)})
print(timeit.repeat('df.groupby("a").agg("mean")', globals={"df": df}, number=10))
print(timeit.repeat('df.groupby("a").agg(np.mean)', globals={"df": df, "np": np}, number=10))
ds = Dataset({"a": DataArray(np.r_[np.arange(500.), np.arange(500.)]),
"b": DataArray(np.arange(1000.))})
print(timeit.repeat('ds.groupby("a").mean()', globals={"ds": ds}, number=10))
This outputs
[0.010462284000823274, 0.009770361997652799, 0.01081446700845845]
[0.02622630601399578, 0.024328112005605362, 0.018717073995503597]
[2.2804569930012804, 2.1666158599982737, 2.2688316510029836]
i.e. xray’s groupby is ~100 times slower than pandas’ one (and 200 times slower than passing "mean"
to pandas’ groupby, which I assume involves some specialization).
(This is the actual order or magnitude of the data size and redundancy I want to handle, i.e. thousands of points with very limited duplication.)
Issue Analytics
- State:
- Created 8 years ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
python - Pandas: df.groupby() is too slow for big data set. Any ...
The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:
Read more >DataFrame groupby is extremely slow when grouping by a ...
When a DataFrame column contains pandas.Period values, and the user attempts to groupby this column, the resulting operation is very, very slow, ...
Read more >Why pandas apply method is slow, and how Terality ...
Let's compare two code snippets computing the sum of the squares of dataframe columns. . -- CODE language-python -- import numpy as np...
Read more >pandas.core.groupby.GroupBy.apply
While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific...
Read more >[Solved]-Pandas groupby apply performing slow-Pandas,Python
Accepted answer. The problem, I believe, is that your data has 5300 distinct groups. Due to this, anything slow within your function will...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
In case anyone gets here by Googling something like “xarray groupby slow” and you loaded data from a netCDF file, be aware that slowness you see in groupby aggregation on a
Dataset
orDataArray
may actually be due not to this issue but to the lazy loading that’s done by default. This can be fixed by calling.load()
on theDataset
orDataArray
. See the Tip about lazy loading at http://xarray.pydata.org/en/stable/io.html#netcdf.#5734 has greatly improved the performance. Fantastic work @dcherian 👏🏽