question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6

See original GitHub issue

Code Sample, a copy-pastable example if possible

import tracemalloc, linecache
import sys, os
import pandas as pd

def display_top_mem(snapshot, key_type='lineno', limit=10):
    """function for displaying lines of code taking most memory"""
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print("#%s: %s:%s: %.1f KiB"
              % (index, filename, frame.lineno, stat.size / 1024))
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print('    %s' % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))


def main():
    tracemalloc.start()
    periods = 745
    df_init = pd.read_csv('./mem_debug_data.csv', index_col=0)

    for i in range(100):
        df = df_init.copy()

        df['l:c:B'] = df['c:B'].rolling(periods).min()
        df['h:c:B'] = df['c:B'].rolling(periods).max()

        #df['l:c:B'] = df['c:B'].rolling(periods).mean()
        #df['h:c:B'] = df['c:B'].rolling(periods).median()

        snapshot = tracemalloc.take_snapshot()
        display_top_mem(snapshot, limit=3)
        print(f'df size {sys.getsizeof(df)/1024} KiB')
        print(f'{i} ##################')


if __name__ == '__main__':
    main()

Problem description

Pandas rolling().min() and rolling().max() functions create memory leaks. I’ve run a tracemalloc line based memory profiling and <__array_function__ internals>:6 seems to always grow in size for every loop iteration in the script above with both of these functions present. For 1000 itereations it will consume around 650MB or RAM, whereas for example if rolling().min() and rolling().max() is changed to rolling().mean()and rolling().median() an run for 1000 iterations, RAM consumption will stay constant at around 4MB. Therefore rolling().min() and rolling().max() seem to be the problem.

The output of this script running for 100 iterations with <__array_function__ internals>:6 constantly increasing in size can be found here: https://pastebin.com/nvGKgmPq

CSV file mem_debug_data.csv used in the script can be found here: http://www.sharecsv.com/s/ad8485d8a0a24a5e12c62957de9b13bd/mem_debug_data.csv

Expected Output

Running rolling().min() and rolling().max() constantly over time should not grow RAM consumption.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.6.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-88-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:5
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
jrebackcommented, Jul 10, 2020

fixed by #33693 in 1.0.4 i think

2reactions
regmegcommented, Feb 27, 2020

Workaround to this is to use numpy with the following strides based functions. Apply and lambda from pandas can be used to on top of rolling, but it is very slow.

def rolling_window_nan_filled(a_org, window):
    a = np.concatenate(( np.full(window-1,np.nan), a_org))
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def numpy_rolling_min(values, periods):
    return np.min(rolling_window_nan_filled(values, periods), axis=1)

def numpy_rolling_max(values, periods):
    return np.max(rolling_window_nan_filled(values, periods), axis=1)

numpy_rolling_min() and numpy_rolling_max() expect numpy based values from a pandas series that can be achieved by df[column].values.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory leak with .rolling().max() in pandas 0.24.2 #25893
Problem description. Memory leak which shuts down my application. This occurs in pandas 0.24.2 but not in pandas 0.23.4. My 16 GB memory ......
Read more >
pandas.core.window.rolling.Rolling.max
Calculate the rolling maximum. Parameters. numeric_onlybool, default False. Include only float, int, boolean columns. New in version 1.5.0.
Read more >
Pandas DataFrame.rolling() Explained [Practical Examples]
By using rolling we can calculate statistical operations like mean() , min() , max() and sum() on the rolling window. mean() will return...
Read more >
Memory leak using pandas dataframe - python - Stack Overflow
Confirmed that there's some kind of memory leak going on in the indexing infrastructure. It's not caused by the above reference graph.
Read more >
Rolling Maximum in a Pandas Column - Data Science Parichay
You can use the pandas rolling() function to get a rolling window of your desired size over the series and then apply the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found