Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance regression in 0.24+ on GroupBy.apply

See original GitHub issue

Code Sample

import time

import numpy as np
import pandas as pd

nrows, ncols = 1000, 100

# data frame with random values and a key to be grouped by
df = pd.DataFrame(np.random.rand(nrows, ncols))
df["key"] = range(nrows)

numeric_columns = list(range(ncols))

grouping = df.groupby(by="key")

# performance regression in apply()
start = time.time()
grouping[numeric_columns].apply(lambda x: x - x.mean())
end = time.time()

print("[pandas=={}] execution time: {:.4f} seconds".format(pd.__version__, end - start))

# [pandas==0.23.4] execution time: 0.8700 seconds
# [pandas==0.24.0] execution time: 24.3790 seconds
# [pandas==0.24.2] execution time: 23.9600 seconds

Problem description

The function GroupBy.apply is a lot slower (~ 25 times) with version 0.24.0, compared to the 0.23.4 release. The problem still persists in the latest 0.24.2 release.

The code sample above shows this performance regression. The purpose of the sample is to subtract the group mean from all elements in this group.

The problem does only occur when the lambda for apply() returns a data frame. There are no performance issues with scalar return values, e.g. lambda x: x.mean().

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

pvcommented, Jun 2, 2019

FWIW, on a desktop computer the benchmark numbers are fairly stable:

$ for i in {1..10}; do asv continuous -f 1 v0.23.4 v0.24.0 -b groupby.Apply --cpu-affinity 5; done
       before           after         ratio
     [04095216]       [83eb2428]
     <v0.23.4^0>       <v0.24.0^0>
+      57.8±0.2ms       94.1±0.3ms     1.63  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+         465±5ms          522±5ms     1.12  groupby.Apply.time_copy_overhead_single_col
+      1.19±0.01s       1.34±0.02s     1.12  groupby.Apply.time_copy_function_multi_col

       before           after         ratio
     [04095216]       [83eb2428]
     <v0.23.4^0>       <v0.24.0^0>
+      56.2±0.1ms       93.5±0.2ms     1.66  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+         451±2ms          526±4ms     1.17  groupby.Apply.time_copy_overhead_single_col
+         1.17±0s       1.35±0.01s     1.16  groupby.Apply.time_copy_function_multi_col
+     6.36±0.09ms       7.35±0.2ms     1.16  groupby.Apply.time_scalar_function_single_col
+      23.1±0.7ms       26.3±0.4ms     1.14  groupby.Apply.time_scalar_function_multi_col

       before           after         ratio
     [04095216]       [83eb2428]
     <v0.23.4^0>       <v0.24.0^0>
+      57.0±0.6ms       93.3±0.9ms     1.64  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+         451±2ms          526±4ms     1.17  groupby.Apply.time_copy_overhead_single_col
+        23.0±1ms       26.7±0.1ms     1.16  groupby.Apply.time_scalar_function_multi_col
+      1.17±0.01s       1.35±0.01s     1.15  groupby.Apply.time_copy_function_multi_col
+      6.64±0.3ms      7.29±0.07ms     1.10  groupby.Apply.time_scalar_function_single_col

       before           after         ratio
     [04095216]       [83eb2428]
     <v0.23.4^0>       <v0.24.0^0>
+      56.0±0.2ms       93.4±0.8ms     1.67  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+     6.38±0.08ms      7.47±0.03ms     1.17  groupby.Apply.time_scalar_function_single_col
+      23.4±0.2ms       27.0±0.5ms     1.16  groupby.Apply.time_scalar_function_multi_col
+         454±2ms          520±8ms     1.15  groupby.Apply.time_copy_overhead_single_col
+      1.18±0.01s       1.33±0.01s     1.13  groupby.Apply.time_copy_function_multi_col

If the accuracy is not sufficient, you can add -a processes=5 to run 5 rounds (instead of the default 2), to get a better sample of the fluctuations.

1reaction

chris-b1commented, Mar 27, 2019

Note that if this is your actual function, you can/should instead do this, which has always been faster

df[numeric_columns] - grouping[numeric_columns].transform('mean')

Top Results From Across the Web

What's new in 1.5.1 (October 19, 2022) - Pandas

A fix for this was attempted in 1.5, however it introduced a regression where passing observed=False and dropna=False to groupby would result in...

Slow performance of pandas groupby/apply - Stack Overflow

I am noticing very slow performance when calling groupby and apply for a pandas dataframe (>100x slower than using pure python).

Tweedie regression on insurance claims - Scikit-learn

This example illustrates the use of Poisson, Gamma and Tweedie regression on the ... IDs df_sev = df_sev.groupby("IDpol").sum() df = df_freq.join(df_sev, ...

Changelog — Featuretools 0.18.1 documentation

Changelog¶. v0.18.1 Aug 12, 2020. Fixes. Fix EntitySet.plot() when given a dask entityset (GH#1086). Changes. Use nlp-primitives[complete] install for ...

Fix list for IBM Integration Bus Version 10.0

IT33915, INTEGRATION API APPLICATIONS RUNNING IN AN INTEGRATION SERVER RETURN NULL FOR ... IT30602, PERFORMANCE REGRESSION IN SAP ADAPTER SCENARIOS.