Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: groupby performance regression in 1.2.x

See original GitHub issue

[ x] I have checked that this issue has not already been reported.
[x ] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
pd.__version__

cols = list('abcdefghjkl')
df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
df_str = df.astype(str)
df_string = df.astype('string')

%timeit df_str.groupby('a')[cols[1:]].agg('last')

%timeit df_string.groupby('a')[cols[1:]].agg('last')

Problem description

Pandas 1.2.x is much slower (9x slower) than 1.1.5 in the groupby aggregation above when the columns are of string dtype. When the columns are of object dtype performance are comparable across the two pandas version.

Expected Output

In pandas 1.1.5 this groupby-aggregation is a bit faster with string dtype than with object dtype:

%timeit df_str.groupby('a')[cols[1:]].agg('last')
680 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_string.groupby('a')[cols[1:]].agg('last')
544 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conversely, in pandas 1.2.4 the same groupby-aggregation is 7x slower with string dtype than with object dtype:

%timeit df_str.groupby('a')[cols[1:]].agg('last')
700 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_string.groupby('a')[cols[1:]].agg('last')
4.93 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would expect comparable performance between pandas 1.1.5 and 1.2.4, instead we have a large performance regression in 1.2.4 when performing the groupby aggregation with string dtypes.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : b5958ee1999e9aead1938c0bba2b674378807b3d python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-17-generic Version : #18-Ubuntu SMP Thu May 6 20:10:11 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.5 numpy : 1.20.3 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 49.6.0.post20210108 Cython : 0.29.23 pytest : 6.2.4 hypothesis : None sphinx : 4.0.1 blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.23.1 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 4.0.0 pytables : None pyxlsb : None s3fs : None scipy : 1.6.3 sqlalchemy : 1.3.23 tables : None tabulate : 0.8.9 xarray : None xlrd : None xlwt : None numba : 0.53.1

Issue Analytics

State:
Created 2 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

simonjayhawkinscommented, Sep 21, 2021

looks like that got fixed…

%timeit df_str.groupby('a')[cols[1:]].agg('last')
# 769 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master (1/6)
# 435 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.2.4
# 417 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.1.5

# 316 ms ± 3.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- master (21/9)
# 325 ms ± 9.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.3
# 318 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.2
# 321 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.1
# 787 ms ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.0

1reaction

mzeitlin11commented, May 21, 2021

Thanks for looking into this @tritemio! I think the issue may that we’re now taking the slow path (non-cython) because of logic here https://github.com/pandas-dev/pandas/blob/751d500e96fc80e27b5c75eaf81f2852cb58f8b8/pandas/core/groupby/ops.py#L321 raising an error such that we go to the slower fallback. Something like last would work for object type string data, so some dispatch logic for StringDType could be added, though there may be subtleties where that would cause other issues.