Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak / increasing usage in Jupyter for repeated cell execution

See original GitHub issue

I believe that with HoloViews 1.8.3 on Jupyter there is a non-trivial memory leak when repeatedly executing a cell. This creates an inconvenience when working/refining iteratively with large data sets.

I’m reporting this issue based on a real-world although admittedly somewhat complex use case, and I’ll admit that I’m not sure I’m using HoloViews correctly. I’m seeing that as I repeatedly execute cells in a Jupyter notebook, the memory usage for the kernel grows without bound. This issues exists whether or not I’m using datashading and large datasets; since the memory increase is proportional to the data size, it’s a lot more noticeable/problematic when there is a lot of data, I’ll focus on the main case.

I’m combining several techniques in order to create a rich, user-friendly interface for reviewing my data. (Kudos to HoloViews for being able to do this at all!) The techniques are:

datashading Scatters of large data sets
creating Layouts of the scatters where the X axis varies across the scatter
creating a HoloMap interface so the user can cycle/explore subsets of the data iteratively
using redim and {+framewise} to ensure that all displays are scaled properly

I’ve supplied some code below that is a non-proprietary repro of my use case. It definitely shows the same pattern of increased kernel memory for each cell invocation. Again, I’ll say that I wrote it through trial and error, and I am by no means sure that I’m not abusing something and/or that there is a better way to accomplish the same things with HoloViews.

Initialization cell

import pandas as pd
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import aggregate, shade, datashade, dynspread
import sys

hv.extension('bokeh')

n,k = 1_000_000,4
scales=np.linspace(1,10,k)

df = pd.concat([s * pd.DataFrame({
    'x1' : np.random.randn(n),
    'x2' : np.abs(np.random.randn(n)),
    'x3' : np.random.chisquare(1, n),
    'x4' : np.random.uniform(0,s,n),
    'y' : np.random.randn(n),
    's' : np.full(n, 1) 
}) for s in scales])

def extend_range(p, frac):
    a, b = np.min(p), np.max(p)
    m, l = (a + b) / 2, (b - a) / 2
    rv = (m - frac * l, m + frac * l)
    return rv

def pad_scatter(s: hv.Scatter, frac=1.05):
    df = s.dframe()
    r = {d.name: extend_range(df[d.name], frac) for d in (s.kdims + s.vdims)[0:2]}
    return s.redim.range(**r)

print(f'df is around {sys.getsizeof(df) // 1024_000} MB')

Running this cell, I get

df is around 218 MB

and my Jupyter kernel is around 1831M.

** Evaluation cell **

%%opts RGB {+framewise}
hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)

This gives me a very beautiful scaled layout of shaded scatters.

memoryleakexample

However, the memory usage as I repeatedly evaluate that cell in the notebook is: 2717M, 3455M, 4441M, 5307M etc.

In reality I’m working with much more data (dataframes of around 10-30GB), and even though I’m on a pretty beefy machine, it starts to become a fairly big problem as I poke around and do trial-and-error exploration. In reality I find myself having to restart the kernel pretty often.

I’m not using dask - maybe I should be - but I’m not sure that would fix the issue.

This issue does not appear to be related to datashader or the large size of the data. If I run something similar with much smaller n and using only a HoloMap instead of datashading, I see a similar increase in memory - just obviously a much smaller slope because n is smaller.

Issue Analytics

State:
Created 6 years ago
Comments:31 (29 by maintainers)

Top GitHub Comments

1reaction

philippjfrcommented, Aug 25, 2017

One way to check whether it’s the Jupyter caching that’s causing this is to reset the out variable using %reset out, which should delete all references to the output.

0reactions

banderlogcommented, May 13, 2021

I have similar problem with tf2, pandas and matplotlib and %reset out actually helped, when gc.collect() did not. Thx, @philippjfr

Links for understanding how it works:

Top Results From Across the Web

Repeated plotting in jupyter lab increases (browser) memory ...

When executing the cell multiple times in Jupyter the memory consumption of Chrome increases per run by over 400mb.

Python memory management in Jupyter Notebook

In this article, I am going to show you how memory management works in Python, and how it affects your code running in...

Memory Leak That Persists After Colab Cell Executes - ADocLib

While working in python jupyter notebooks using pandas with small datasets CSDNMemory leak / increasing usage in Jupyter for repeated cell. System information....

How to Combat Python Memory Leaks | Scout APM Blog

The best way to find a memory leak in an application is to monitor its memory usage for set intervals of time. If...

How to Detect Memory Leakage in Your Python Application

Standard Python libraries that could tell the memory usage and execution time of every line · the quick and dirty way to measure...