Leaks memory when input is not a numpy array
See original GitHub issueIf you run the following program you see that nansum leaks all the memory it are given when passed a Pandas object. If it is passed the ndarray underlying the Pandas object instead then there is no leak:
import psutil
import gc
def f():
x = np.zeros(10*1024*1024, dtype='f4')
# Leaks 40MB/iteration
bottleneck.nansum(pd.Series(x))
# No leak:
#bottleneck.nansum(x)
process = psutil.Process(os.getpid())
def _get_usage():
gc.collect()
return process.memory_info().private / (1024*1024)
last_usage = _get_usage()
print(last_usage)
for _ in range(10):
f()
usage = _get_usage()
print(usage - last_usage)
last_usage = usage
This affects not just nansum, but apparently all the reduction functions (with or without axis specified), and at least some other functions like move_max.
I’m not completely sure why this happens, but maybe it’s because PyArray_FROM_O is allocating a new array in this case, and the ref count of that is not being decremented by anyone? https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/src/reduce_template.c#L1237
I’m using Bottleneck 1.2.1 with Pandas 0.23.1. sys.version is 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)].
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:15 (3 by maintainers)

Top Related StackOverflow Question
OK, I merged the memory leak fix into master.
Sorry, it’s actually fine… just if someone stumbles over this again I add this here: I underestimated how much memory
np.sumandnp.nansumuse temporarily. Here is a profile for both sum operations, with either only numpy arrays or a mix of one array and oneh5py.Datasetlikenp.sum([arr, dset]). A single array/dataset has 256MB, and we always create/operate on two of those: