Performance regression in 0.24+ on GroupBy.apply
See original GitHub issueCode Sample
import time
import numpy as np
import pandas as pd
nrows, ncols = 1000, 100
# data frame with random values and a key to be grouped by
df = pd.DataFrame(np.random.rand(nrows, ncols))
df["key"] = range(nrows)
numeric_columns = list(range(ncols))
grouping = df.groupby(by="key")
# performance regression in apply()
start = time.time()
grouping[numeric_columns].apply(lambda x: x - x.mean())
end = time.time()
print("[pandas=={}] execution time: {:.4f} seconds".format(pd.__version__, end - start))
# [pandas==0.23.4] execution time: 0.8700 seconds
# [pandas==0.24.0] execution time: 24.3790 seconds
# [pandas==0.24.2] execution time: 23.9600 seconds
Problem description
The function GroupBy.apply
is a lot slower (~ 25 times) with version 0.24.0, compared to the 0.23.4 release. The problem still persists in the latest 0.24.2 release.
The code sample above shows this performance regression. The purpose of the sample is to subtract the group mean from all elements in this group.
The problem does only occur when the lambda for apply()
returns a data frame.
There are no performance issues with scalar return values, e.g. lambda x: x.mean()
.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:11 (7 by maintainers)
Top Results From Across the Web
What's new in 1.5.1 (October 19, 2022) - Pandas
A fix for this was attempted in 1.5, however it introduced a regression where passing observed=False and dropna=False to groupby would result in...
Read more >Slow performance of pandas groupby/apply - Stack Overflow
I am noticing very slow performance when calling groupby and apply for a pandas dataframe (>100x slower than using pure python).
Read more >Tweedie regression on insurance claims - Scikit-learn
This example illustrates the use of Poisson, Gamma and Tweedie regression on the ... IDs df_sev = df_sev.groupby("IDpol").sum() df = df_freq.join(df_sev, ...
Read more >Changelog — Featuretools 0.18.1 documentation
Changelog¶. v0.18.1 Aug 12, 2020. Fixes. Fix EntitySet.plot() when given a dask entityset (GH#1086). Changes. Use nlp-primitives[complete] install for ...
Read more >Fix list for IBM Integration Bus Version 10.0
IT33915, INTEGRATION API APPLICATIONS RUNNING IN AN INTEGRATION SERVER RETURN NULL FOR ... IT30602, PERFORMANCE REGRESSION IN SAP ADAPTER SCENARIOS.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
FWIW, on a desktop computer the benchmark numbers are fairly stable:
If the accuracy is not sufficient, you can add
-a processes=5
to run 5 rounds (instead of the default 2), to get a better sample of the fluctuations.Note that if this is your actual function, you can/should instead do this, which has always been faster