Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak in numpy.rec.fromarrays

See original GitHub issue

Hey so I was having a serious memory issue using dedupe making it impossible to use, where the scale of the problem wasn’t that big. I was using record linkage mode to link two datasets of about 150k on my macbook pro with 16gb of ram. As the program would run it would be on track to finish in about 4 hrs but the memory usage would slowly increase as comparisons were made and scored to well over 16 gb and then my computer would crash (about 70% of the way in). After a lot of digging I traced it down to this line in dedupe.core: (line 154) scored_pairs = numpy.rec.fromarrays((ids, scores), dtype= [(‘pairs’, ‘object’, 2), (‘score’, ‘f4’, 1)])

If you comment this line out (i.e create the comparisons and score them, but don’t store the results) I get a stable memory usage of around 3-4 gb (this is using 1 core, otherwise that number would be multiplied). I know that the increase in memory is not from just storing the results that get past the threshold, as this should only be 1 gb maybe at most. It is from a memory leak in the numpy function that is not releasing memory as this array is created in the loop. Has anyone else noticed this? I am using numpy 1.10.2 and dedupe 1.2.2. With this leak dedupe is unable to process this modest data set. To get around this I changed the code to just store the results that get past the threshold in standard arrays, and then put all of the results into the numpy record array after the loop is finished to be consistent with the rest of dedupe. This fixed the memory issue for me. Now I can process the whole dataset in about 3-4 hours, and the memory is stable around 4 gb, only very slightly increasing over time due to storing the comparisons that are above the threshold, to maybe 4.5 gb total at the end. This is the code that I used to test and get around the leak. I wrote a new function scoreDuplicates2 which has the same input and output as scoreDuplicates but doesn’t do any multiprocessing queue’s so I had greater transparency. You also might notice there is a hacky zip function at the bottom and seemingly dumb list conversion. I did that because for some reason zip was actually calling itertools.izip and the recarrays call was being finicky about the exact format the lists were given in. Also I am using numpy.core.records.fromarrays rather than numpy.rec.arrays as it actually has a little bit of documentation and I am reasonably sure they call the same function.

def scoreDuplicates2(record_pairs, data_model, classifier, num_cores=1, threshold=0.5) :

final_ids1 = numpy.array([])
final_ids2 = numpy.array([])
final_scores = numpy.array([])
chunk_size = 100000
total = 0
while True:
    ids1 = []
    ids2 = []
    records = []
    chunk = itertools.islice(record_pairs, chunk_size)
    next, chunk = peek(chunk)
    if next:
        for record_pair in chunk:
            ((id_1, record_1, smaller_ids_1), 
             (id_2, record_2, smaller_ids_2)) = record_pair

            if set.isdisjoint(smaller_ids_1, smaller_ids_2) :
                ids1.append(id_1)
                ids2.append(id_2)
                records.append((record_1, record_2))

        if records :
            distances = data_model.distances(records)
            scores = classifier.predict_proba(distances)[:,-1]
            mask = scores > threshold
            final_ids1 = numpy.append(final_ids1, numpy.array(ids1)[mask])
            final_ids2 = numpy.append(final_ids2, numpy.array(ids2)[mask])
            final_scores = numpy.append(final_scores, scores[mask])
            #scored_pairs = numpy.core.records.fromarrays((ids, scores),
            #                                    dtype= [('pairs', 'object', 2), 
            #                                            ('score', 'f4', 1)])


        total += chunk_size
    else:
        final_ids = []
        for i in range(len(final_ids1)):
            final_ids.append((final_ids1[i],final_ids2[i]))
        final_scores = list(final_scores)
        final_pairs = numpy.core.records.fromarrays((final_ids, final_scores), 
                                                     dtype=[('pairs', 'object', 2), 
                                                            ('score', 'f4', 1)])
        return final_pairs

Issue Analytics

State:
Created 8 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

lucaswisercommented, May 3, 2016

Yes I could look at that

On Tue, May 3, 2016 at 2:15 PM, Forest Gregg notifications@github.com wrote:

@lucaswiser https://github.com/lucaswiser would you mind checking if this last commit fixes the memory leak you were seeing?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/datamade/dedupe/issues/434#issuecomment-216666904

[image: logo] *Lucas Adams *| *Data Scientist * 855.469.4737 | lucas.adams@wiser.com | www.wiser.com http://www.wiser.com/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature Address: 267 8th Street, San Francisco, CA 94103

http://facebook.com/wisepricer?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature

http://twitter.com/wisepricer?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature

http://linkedin.com/company/wiser?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature http://blog.wiser.com/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature

0reactions

fgreggcommented, May 3, 2016

@lucaswiser would you mind checking if this last commit fixes the memory leak you were seeing?

Top Results From Across the Web

NumPy views: saving memory, leaking memory, and subtle bugs

NumPy uses memory views transparently, as a way to save memory. But you need to understand how they work, so you don't leak...

Memory leak when returing a numpy array - Stack Overflow

I am analyzing a large amount of data with numpy/pandas by sorting the data into bins. I have a loop that fills in...

Numerical Python - Browse /NumPy/1.10.2 at SourceForge.net

gh-6618 NPY_FORTRANORDER in make_fortran() in numpy.i * gh-6636 Memory leak in nested dtypes in numpy.recarray * gh-6641 Subsetting recarray by fields ...

Release Notes — NumPy v1.16 Manual

A small memory leak in PyArray_AdaptFlexibleDType has been fixed. The Python versions supported by this release are 2.7, 3.4-3.7.

Apparent memory leak with lib-lept (leptonica) and/or ... - GitLab

The test procedure is as follows: construct and destruct a Pix image from a numpy image 1000s of times in a python loop...