question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: groupby performance regression in 1.2.x

See original GitHub issue
  • [ x] I have checked that this issue has not already been reported.

  • [x ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
pd.__version__

cols = list('abcdefghjkl')
df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
df_str = df.astype(str)
df_string = df.astype('string')

%timeit df_str.groupby('a')[cols[1:]].agg('last')

%timeit df_string.groupby('a')[cols[1:]].agg('last')

Problem description

Pandas 1.2.x is much slower (9x slower) than 1.1.5 in the groupby aggregation above when the columns are of string dtype. When the columns are of object dtype performance are comparable across the two pandas version.

Expected Output

In pandas 1.1.5 this groupby-aggregation is a bit faster with string dtype than with object dtype:

%timeit df_str.groupby('a')[cols[1:]].agg('last')
680 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_string.groupby('a')[cols[1:]].agg('last')
544 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conversely, in pandas 1.2.4 the same groupby-aggregation is 7x slower with string dtype than with object dtype:

%timeit df_str.groupby('a')[cols[1:]].agg('last')
700 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_string.groupby('a')[cols[1:]].agg('last')
4.93 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would expect comparable performance between pandas 1.1.5 and 1.2.4, instead we have a large performance regression in 1.2.4 when performing the groupby aggregation with string dtypes.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee1999e9aead1938c0bba2b674378807b3d python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-17-generic Version : #18-Ubuntu SMP Thu May 6 20:10:11 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.5 numpy : 1.20.3 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 49.6.0.post20210108 Cython : 0.29.23 pytest : 6.2.4 hypothesis : None sphinx : 4.0.1 blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.23.1 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 4.0.0 pytables : None pyxlsb : None s3fs : None scipy : 1.6.3 sqlalchemy : 1.3.23 tables : None tabulate : 0.8.9 xarray : None xlrd : None xlwt : None numba : 0.53.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
simonjayhawkinscommented, Sep 21, 2021

looks like that got fixed…

%timeit df_str.groupby('a')[cols[1:]].agg('last')
# 769 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master (1/6)
# 435 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.2.4
# 417 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.1.5

# 316 ms ± 3.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- master (21/9)
# 325 ms ± 9.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.3
# 318 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.2
# 321 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.1
# 787 ms ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.0
1reaction
mzeitlin11commented, May 21, 2021

Thanks for looking into this @tritemio! I think the issue may that we’re now taking the slow path (non-cython) because of logic here https://github.com/pandas-dev/pandas/blob/751d500e96fc80e27b5c75eaf81f2852cb58f8b8/pandas/core/groupby/ops.py#L321 raising an error such that we go to the slower fallback. Something like last would work for object type string data, so some dispatch logic for StringDType could be added, though there may be subtleties where that would cause other issues.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tutorial - Perf Wiki
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and ...
Read more >
Performance considerations for EF 4, 5, and 6 - Microsoft Learn
Let's take a high-level view of where time is spent when executing a query using Entity Framework, and see where things are improving...
Read more >
Datacamp R - Supervised Learning in R Regression
Predict the student's final grade (A, B, C, D, F) in the class given scores on midterms and homework (before they've taken the...
Read more >
z/VM Performance Report
CP Regression Measurements CP Disk I/O Performance New Functions Enhanced Large Real Storage Exploitation Extended Diagnose X'44' Fast Path
Read more >
Improving the performance of pandas groupby - Stack Overflow
transform(lambda x: x - x.shift(1)).fillna(0) data[[v + 'Mean' for v in val_cols]] ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found