Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Leaks memory when input is not a numpy array

See original GitHub issue

If you run the following program you see that nansum leaks all the memory it are given when passed a Pandas object. If it is passed the ndarray underlying the Pandas object instead then there is no leak:

import psutil
import gc

def f():
    x = np.zeros(10*1024*1024, dtype='f4')

    # Leaks 40MB/iteration
    bottleneck.nansum(pd.Series(x))
    # No leak:
    #bottleneck.nansum(x)

process = psutil.Process(os.getpid())

def _get_usage():
    gc.collect()
    return process.memory_info().private / (1024*1024)

last_usage = _get_usage()
print(last_usage)

for _ in range(10):
    f()
    usage = _get_usage()
    print(usage - last_usage)
    last_usage = usage

This affects not just nansum, but apparently all the reduction functions (with or without axis specified), and at least some other functions like move_max.

I’m not completely sure why this happens, but maybe it’s because PyArray_FROM_O is allocating a new array in this case, and the ref count of that is not being decremented by anyone? https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/src/reduce_template.c#L1237

I’m using Bottleneck 1.2.1 with Pandas 0.23.1. sys.version is 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)].

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:15 (3 by maintainers)

Top GitHub Comments

1reaction

kwgoodmancommented, Jan 5, 2019

OK, I merged the memory leak fix into master.

0reactions

tensionheadcommented, Nov 16, 2022

Sorry, it’s actually fine… just if someone stumbles over this again I add this here: I underestimated how much memory np.sum and np.nansum use temporarily. Here is a profile for both sum operations, with either only numpy arrays or a mix of one array and one h5py.Dataset like np.sum([arr, dset]). A single array/dataset has 256MB, and we always create/operate on two of those:

Screenshot from 2022-11-16 13:59:41