Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6

See original GitHub issue

Code Sample, a copy-pastable example if possible

import tracemalloc, linecache
import sys, os
import pandas as pd

def display_top_mem(snapshot, key_type='lineno', limit=10):
    """function for displaying lines of code taking most memory"""
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print("#%s: %s:%s: %.1f KiB"
              % (index, filename, frame.lineno, stat.size / 1024))
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print('    %s' % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))


def main():
    tracemalloc.start()
    periods = 745
    df_init = pd.read_csv('./mem_debug_data.csv', index_col=0)

    for i in range(100):
        df = df_init.copy()

        df['l:c:B'] = df['c:B'].rolling(periods).min()
        df['h:c:B'] = df['c:B'].rolling(periods).max()

        #df['l:c:B'] = df['c:B'].rolling(periods).mean()
        #df['h:c:B'] = df['c:B'].rolling(periods).median()

        snapshot = tracemalloc.take_snapshot()
        display_top_mem(snapshot, limit=3)
        print(f'df size {sys.getsizeof(df)/1024} KiB')
        print(f'{i} ##################')


if __name__ == '__main__':
    main()

Problem description

Pandas rolling().min() and rolling().max() functions create memory leaks. I’ve run a tracemalloc line based memory profiling and <__array_function__ internals>:6 seems to always grow in size for every loop iteration in the script above with both of these functions present. For 1000 itereations it will consume around 650MB or RAM, whereas for example if rolling().min() and rolling().max() is changed to rolling().mean()and rolling().median() an run for 1000 iterations, RAM consumption will stay constant at around 4MB. Therefore rolling().min() and rolling().max() seem to be the problem.

The output of this script running for 100 iterations with <__array_function__ internals>:6 constantly increasing in size can be found here: https://pastebin.com/nvGKgmPq

CSV file mem_debug_data.csv used in the script can be found here: http://www.sharecsv.com/s/ad8485d8a0a24a5e12c62957de9b13bd/mem_debug_data.csv

Expected Output

Running rolling().min() and rolling().max() constantly over time should not grow RAM consumption.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.7.6.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-88-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

State:
Created 4 years ago
Reactions:5
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

jrebackcommented, Jul 10, 2020

fixed by #33693 in 1.0.4 i think

2reactions

regmegcommented, Feb 27, 2020

Workaround to this is to use numpy with the following strides based functions. Apply and lambda from pandas can be used to on top of rolling, but it is very slow.

def rolling_window_nan_filled(a_org, window):
    a = np.concatenate(( np.full(window-1,np.nan), a_org))
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def numpy_rolling_min(values, periods):
    return np.min(rolling_window_nan_filled(values, periods), axis=1)

def numpy_rolling_max(values, periods):
    return np.max(rolling_window_nan_filled(values, periods), axis=1)

numpy_rolling_min() and numpy_rolling_max() expect numpy based values from a pandas series that can be achieved by df[column].values.