question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

1.3.0 PerformanceWarning: DataFrame is highly fragmented.

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

Minimal sample

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
    # Alternative assignment - triggers Performancewarnings here already.
    # df[f'n_{i}'] = np.random.randint(0, 100, size=55)

df1 = df.copy()
# Triggers performance warning again
df1['c'] = np.random.randint(0, 100, size=55)

# Visualize blocks
print(df._data.nblocks)
print(df1._data.nblocks)

Problem description

Since pandas 1.3.0, the above minimal sample code produces the output of a Performance warning. While i think i understand the warning - i don’t understand how to mitigate it (the docs don’t contain help i could find for this - and the proposed solution (copy() does not seem to work.

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider using pd.concat instead.  To get a de-fragmented frame, use `newframe = frame.copy()`

While this for sure isn’t an ideal scenario (assignment of single columns one after the other), i also don’t see how this can be changed in our usecase.

The proposed df.copy() does not mitigate the warning - and the block count remains the same. Based on my understanding, using df.loc[:, 'colname'] = is the recommended way to assign new columns. This does create a new block for every insert - and df.copy() (which is proposed in the error) does not consolidate the blocks into 1 block - which means the error can’t really be mitigated.

Strangely enough - the behaviour of df['colname] = and df.loc[:, 'colname'] = is not identical - with the first triggering the PerformanceWarning - and the 2nd not triggering the warning (although the problem is still there in the background).

So this leaves me with a few questions

  • How should the above scenario correctly handle inserts to keep performance and avoid this error?
  • how can the dataframe be effectively consolidated (the proposed frame.copy() in the error does not do that)

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.9.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.12.11-arch1-1
Version          : #1 SMP PREEMPT Wed, 16 Jun 2021 15:25:28 +0000
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.utf8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.1.3
setuptools       : 57.0.0
Cython           : None
pytest           : 6.2.4
hypothesis       : None
sphinx           : None
blosc            : 1.10.4
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : 1.0.2
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.1
numexpr          : 2.7.3
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.0
sqlalchemy       : 1.4.20
tables           : 3.6.1
tabulate         : 0.8.9
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:11
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
Alex-leycommented, Feb 22, 2022

here is a concrete example of how much faster concat can be if used properly - in keeping with the sample above: image

before 28.6 ms ± 586 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning) # only so stdout/stderr fits on 1 page in Jupyter

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    # triggers PerformanceWarnings here already.
    df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
    # Alternative assignment - also triggers PerformanceWarnings and same speed
    # df[f'n_{i}'] = np.random.randint(0, 100, size=55)

after 2.33 ms ± 92.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np
import warnings
# warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning) # no longer needed

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'n_{i}'] = np.random.randint(0,100,size=55)
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)

another example of something that might not be immediately intuitive but makes sense when you think about it: (obviously for x**2 you can do this with pandas vectorized methods that would be even faster but this is just to show the speedup of apply vs list comprehension. Not all functions you want to use in apply have pandas built-in equivalents) image

before 17 ms ± 628 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'a_{i}'] = df["a"].apply(
        lambda x: x**2
    )
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)

after5.99 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'a_{i}'] = [
        x**2 for x in df["a"]
    ]
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)
1reaction
jbrockmendelcommented, Jul 11, 2021

Yes, this was intentinal.

the proposed frame.copy() in the error does not do that

This is a bug that should be fixed.

How should the above scenario correctly handle inserts to keep performance and avoid this error?

If the .copy bug is fixed, then you should be fine if you do all your inserts and then do .copy(). A better option would be to use pd.concat to do it all at once

Read more comments on GitHub >

github_iconTop Results From Across the Web

PerformanceWarning: DataFrame is highly fragmented. This is ...
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance.
Read more >
PerformanceWarning: DataFrame is highly fragmented #340
It occurs when you are appending a Pandas DataFrame to an existing Pandas DataFrame. They recommend using pd.concat() to quickly combine two or ......
Read more >
DataFrame is highly fragmented. This is usually the result of ...
[Code]-PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance-pandas ...
Read more >
mitigating a performance warning from pandas (DataFrame is ...
Pandas : mitigating a performance warning from pandas ( DataFrame is highly fragmented ) [ Beautify Your Computer ...
Read more >
I think Pandas may have “lost the plot.” - Python-bloggers
If I need high performance or scale, I can move to a database. ... PerformanceWarning: DataFrame is highly fragmented.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found