Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6
See original GitHub issueCode Sample, a copy-pastable example if possible
import tracemalloc, linecache
import sys, os
import pandas as pd
def display_top_mem(snapshot, key_type='lineno', limit=10):
"""function for displaying lines of code taking most memory"""
snapshot = snapshot.filter_traces((
tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
tracemalloc.Filter(False, "<unknown>"),
))
top_stats = snapshot.statistics(key_type)
print("Top %s lines" % limit)
for index, stat in enumerate(top_stats[:limit], 1):
frame = stat.traceback[0]
# replace "/path/to/module/file.py" with "module/file.py"
filename = os.sep.join(frame.filename.split(os.sep)[-2:])
print("#%s: %s:%s: %.1f KiB"
% (index, filename, frame.lineno, stat.size / 1024))
line = linecache.getline(frame.filename, frame.lineno).strip()
if line:
print(' %s' % line)
other = top_stats[limit:]
if other:
size = sum(stat.size for stat in other)
print("%s other: %.1f KiB" % (len(other), size / 1024))
total = sum(stat.size for stat in top_stats)
print("Total allocated size: %.1f KiB" % (total / 1024))
def main():
tracemalloc.start()
periods = 745
df_init = pd.read_csv('./mem_debug_data.csv', index_col=0)
for i in range(100):
df = df_init.copy()
df['l:c:B'] = df['c:B'].rolling(periods).min()
df['h:c:B'] = df['c:B'].rolling(periods).max()
#df['l:c:B'] = df['c:B'].rolling(periods).mean()
#df['h:c:B'] = df['c:B'].rolling(periods).median()
snapshot = tracemalloc.take_snapshot()
display_top_mem(snapshot, limit=3)
print(f'df size {sys.getsizeof(df)/1024} KiB')
print(f'{i} ##################')
if __name__ == '__main__':
main()
Problem description
Pandas rolling().min()
and rolling().max()
functions create memory leaks. I’ve run a tracemalloc line based memory profiling and <__array_function__ internals>:6
seems to always grow in size for every loop iteration in the script above with both of these functions present. For 1000 itereations it will consume around 650MB or RAM, whereas for example if rolling().min()
and rolling().max()
is changed to rolling().mean()
and rolling().median()
an run for 1000 iterations, RAM consumption will stay constant at around 4MB. Therefore rolling().min()
and rolling().max()
seem to be the problem.
The output of this script running for 100 iterations with <__array_function__ internals>:6
constantly increasing in size can be found here: https://pastebin.com/nvGKgmPq
CSV file mem_debug_data.csv
used in the script can be found here: http://www.sharecsv.com/s/ad8485d8a0a24a5e12c62957de9b13bd/mem_debug_data.csv
Expected Output
Running rolling().min()
and rolling().max()
constantly over time should not grow RAM consumption.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.7.6.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-88-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8
pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None
Issue Analytics
- State:
- Created 4 years ago
- Reactions:5
- Comments:7 (2 by maintainers)
Top GitHub Comments
fixed by #33693 in 1.0.4 i think
Workaround to this is to use numpy with the following strides based functions. Apply and lambda from pandas can be used to on top of rolling, but it is very slow.
numpy_rolling_min()
andnumpy_rolling_max()
expect numpy based values from a pandas series that can be achieved bydf[column].values
.