Enabling chained assignment checks (SettingWithCopyWarning) can have huge performance impact
See original GitHub issueSimilar to an observation on reddit I noticed that there is a huge performance difference between the default pandas pd.options.mode.chained_assignment = 'warn'
over setting it to None
.
Code Sample
import time
import pandas as pd
import numpy as np
def gen_data(N=10000):
df = pd.DataFrame(index=range(N))
for c in range(10):
df[str(c)] = np.random.uniform(size=N)
df["id"] = np.random.choice(range(500), size=len(df))
return df
def do_something_on_df(df):
""" Dummy computation that contains inplace mutations """
for c in range(df.shape[1]):
df[str(c)] = np.random.uniform(size=df.shape[0])
return 42
def run_test(mode="warn"):
pd.options.mode.chained_assignment = mode
df = gen_data()
t1 = time.time()
for key, group_df in df.groupby("id"):
do_something_on_df(group_df)
t2 = time.time()
print("Runtime: {:10.3f} sec".format(t2 - t1))
if __name__ == "__main__":
run_test(mode="warn")
run_test(mode=None)
Problem description
The run times vary a lot depending on the whether the SettingWithCopyWarning
is enabled or disable. I tried with a few different Pandas/Python versions:
Debian VM, Python 3.6.2, pandas 0.21.0
Runtime: 46.693 sec
Runtime: 0.731 sec
Debian VM, Python 2.7.9, pandas 0.20.0
Runtime: 101.204 sec
Runtime: 0.622 sec
Ubuntu (host), Python 2.7.3, pandas 0.21.0
Runtime: 35.363 sec
Runtime: 0.517 sec
Ideally, there should not be such a big penalty for SettingWithCopyWarning
.
From profiling results it looks like the reason might be this call to gc.collect
.
Output of pd.show_versions()
pandas: 0.21.0 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.13.3 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (6 by maintainers)
Top GitHub Comments
It would probably be helpful to document the performance impact more clearly. This can have subtle side effects, which are very hard to find. I only noticed it, because a Dask/Distributed computation was much slower than expected (use case documented on SO)
of course, this has to run the garbage collector. You can certainly just disable them. This wont’ be fixed in pandas 2.