Memory leak / increasing usage in Jupyter for repeated cell execution
See original GitHub issueI believe that with HoloViews 1.8.3 on Jupyter there is a non-trivial memory leak when repeatedly executing a cell. This creates an inconvenience when working/refining iteratively with large data sets.
I’m reporting this issue based on a real-world although admittedly somewhat complex use case, and I’ll admit that I’m not sure I’m using HoloViews correctly. I’m seeing that as I repeatedly execute cells in a Jupyter notebook, the memory usage for the kernel grows without bound. This issues exists whether or not I’m using datashading and large datasets; since the memory increase is proportional to the data size, it’s a lot more noticeable/problematic when there is a lot of data, I’ll focus on the main case.
I’m combining several techniques in order to create a rich, user-friendly interface for reviewing my data. (Kudos to HoloViews for being able to do this at all!) The techniques are:
- datashading Scatters of large data sets
- creating Layouts of the scatters where the X axis varies across the scatter
- creating a HoloMap interface so the user can cycle/explore subsets of the data iteratively
- using
redim
and{+framewise}
to ensure that all displays are scaled properly
I’ve supplied some code below that is a non-proprietary repro of my use case. It definitely shows the same pattern of increased kernel memory for each cell invocation. Again, I’ll say that I wrote it through trial and error, and I am by no means sure that I’m not abusing something and/or that there is a better way to accomplish the same things with HoloViews.
Initialization cell
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import aggregate, shade, datashade, dynspread
import sys
hv.extension('bokeh')
n,k = 1_000_000,4
scales=np.linspace(1,10,k)
df = pd.concat([s * pd.DataFrame({
'x1' : np.random.randn(n),
'x2' : np.abs(np.random.randn(n)),
'x3' : np.random.chisquare(1, n),
'x4' : np.random.uniform(0,s,n),
'y' : np.random.randn(n),
's' : np.full(n, 1)
}) for s in scales])
def extend_range(p, frac):
a, b = np.min(p), np.max(p)
m, l = (a + b) / 2, (b - a) / 2
rv = (m - frac * l, m + frac * l)
return rv
def pad_scatter(s: hv.Scatter, frac=1.05):
df = s.dframe()
r = {d.name: extend_range(df[d.name], frac) for d in (s.kdims + s.vdims)[0:2]}
return s.redim.range(**r)
print(f'df is around {sys.getsizeof(df) // 1024_000} MB')
Running this cell, I get
df is around 218 MB
and my Jupyter kernel is around 1831M.
** Evaluation cell **
%%opts RGB {+framewise}
hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
This gives me a very beautiful scaled layout of shaded scatters.
However, the memory usage as I repeatedly evaluate that cell in the notebook is: 2717M, 3455M, 4441M, 5307M etc.
In reality I’m working with much more data (dataframes of around 10-30GB), and even though I’m on a pretty beefy machine, it starts to become a fairly big problem as I poke around and do trial-and-error exploration. In reality I find myself having to restart the kernel pretty often.
I’m not using dask - maybe I should be - but I’m not sure that would fix the issue.
This issue does not appear to be related to datashader or the large size of the data. If I run something similar with much smaller n
and using only a HoloMap
instead of datashading, I see a similar increase in memory - just obviously a much smaller slope because n
is smaller.
Issue Analytics
- State:
- Created 6 years ago
- Comments:31 (29 by maintainers)
Top GitHub Comments
One way to check whether it’s the Jupyter caching that’s causing this is to reset the
out
variable using%reset out
, which should delete all references to the output.I have similar problem with tf2, pandas and matplotlib and
%reset out
actually helped, whengc.collect()
did not. Thx, @philippjfrLinks for understanding how it works: