1.3.0 PerformanceWarning: DataFrame is highly fragmented.
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Minimal sample
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
# Alternative assignment - triggers Performancewarnings here already.
# df[f'n_{i}'] = np.random.randint(0, 100, size=55)
df1 = df.copy()
# Triggers performance warning again
df1['c'] = np.random.randint(0, 100, size=55)
# Visualize blocks
print(df._data.nblocks)
print(df1._data.nblocks)
Problem description
Since pandas 1.3.0, the above minimal sample code produces the output of a Performance warning.
While i think i understand the warning - i don’t understand how to mitigate it (the docs don’t contain help i could find for this - and the proposed solution (copy()
does not seem to work.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use `newframe = frame.copy()`
While this for sure isn’t an ideal scenario (assignment of single columns one after the other), i also don’t see how this can be changed in our usecase.
The proposed df.copy()
does not mitigate the warning - and the block count remains the same.
Based on my understanding, using df.loc[:, 'colname'] =
is the recommended way to assign new columns.
This does create a new block for every insert - and df.copy()
(which is proposed in the error) does not consolidate the blocks into 1 block - which means the error can’t really be mitigated.
Strangely enough - the behaviour of df['colname] =
and df.loc[:, 'colname'] =
is not identical - with the first triggering the PerformanceWarning - and the 2nd not triggering the warning (although the problem is still there in the background).
So this leaves me with a few questions
- How should the above scenario correctly handle inserts to keep performance and avoid this error?
- how can the dataframe be effectively consolidated (the proposed
frame.copy()
in the error does not do that)
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : f00ed8f47020034e752baf0250483053340971b0
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.12.11-arch1-1
Version : #1 SMP PREEMPT Wed, 16 Jun 2021 15:25:28 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.utf8
LOCALE : en_US.UTF-8
pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : 1.10.4
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.20
tables : 3.6.1
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : None
Issue Analytics
- State:
- Created 2 years ago
- Reactions:11
- Comments:5 (3 by maintainers)
here is a concrete example of how much faster concat can be if used properly - in keeping with the sample above:
before
28.6 ms ± 586 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
:after
2.33 ms ± 92.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
:another example of something that might not be immediately intuitive but makes sense when you think about it: (obviously for x**2 you can do this with pandas vectorized methods that would be even faster but this is just to show the speedup of apply vs list comprehension. Not all functions you want to use in apply have pandas built-in equivalents)
before
17 ms ± 628 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
:after
5.99 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
:Yes, this was intentinal.
This is a bug that should be fixed.
If the .copy bug is fixed, then you should be fine if you do all your inserts and then do .copy(). A better option would be to use pd.concat to do it all at once