PERF: groupby performance regression in 1.2.x
See original GitHub issue-
[ x] I have checked that this issue has not already been reported.
-
[x ] I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
pd.__version__
cols = list('abcdefghjkl')
df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
df_str = df.astype(str)
df_string = df.astype('string')
%timeit df_str.groupby('a')[cols[1:]].agg('last')
%timeit df_string.groupby('a')[cols[1:]].agg('last')
Problem description
Pandas 1.2.x is much slower (9x slower) than 1.1.5 in the groupby aggregation above when the columns are of string
dtype. When the columns are of object
dtype performance are comparable across the two pandas version.
Expected Output
In pandas 1.1.5 this groupby-aggregation is a bit faster with string
dtype than with object
dtype:
%timeit df_str.groupby('a')[cols[1:]].agg('last')
680 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_string.groupby('a')[cols[1:]].agg('last')
544 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conversely, in pandas 1.2.4 the same groupby-aggregation is 7x slower with string
dtype than with object
dtype:
%timeit df_str.groupby('a')[cols[1:]].agg('last')
700 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_string.groupby('a')[cols[1:]].agg('last')
4.93 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I would expect comparable performance between pandas 1.1.5 and 1.2.4, instead we have a large performance regression in 1.2.4 when performing the groupby aggregation with string dtypes.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : b5958ee1999e9aead1938c0bba2b674378807b3d python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-17-generic Version : #18-Ubuntu SMP Thu May 6 20:10:11 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.1.5 numpy : 1.20.3 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 49.6.0.post20210108 Cython : 0.29.23 pytest : 6.2.4 hypothesis : None sphinx : 4.0.1 blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.23.1 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 4.0.0 pytables : None pyxlsb : None s3fs : None scipy : 1.6.3 sqlalchemy : 1.3.23 tables : None tabulate : 0.8.9 xarray : None xlrd : None xlwt : None numba : 0.53.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)
looks like that got fixed…
Thanks for looking into this @tritemio! I think the issue may that we’re now taking the slow path (non-cython) because of logic here https://github.com/pandas-dev/pandas/blob/751d500e96fc80e27b5c75eaf81f2852cb58f8b8/pandas/core/groupby/ops.py#L321 raising an error such that we go to the slower fallback. Something like
last
would work for object type string data, so some dispatch logic for StringDType could be added, though there may be subtleties where that would cause other issues.