Memory leak in numpy.rec.fromarrays
See original GitHub issueHey so I was having a serious memory issue using dedupe making it impossible to use, where the scale of the problem wasn’t that big. I was using record linkage mode to link two datasets of about 150k on my macbook pro with 16gb of ram. As the program would run it would be on track to finish in about 4 hrs but the memory usage would slowly increase as comparisons were made and scored to well over 16 gb and then my computer would crash (about 70% of the way in). After a lot of digging I traced it down to this line in dedupe.core: (line 154) scored_pairs = numpy.rec.fromarrays((ids, scores), dtype= [(‘pairs’, ‘object’, 2), (‘score’, ‘f4’, 1)])
If you comment this line out (i.e create the comparisons and score them, but don’t store the results) I get a stable memory usage of around 3-4 gb (this is using 1 core, otherwise that number would be multiplied). I know that the increase in memory is not from just storing the results that get past the threshold, as this should only be 1 gb maybe at most. It is from a memory leak in the numpy function that is not releasing memory as this array is created in the loop. Has anyone else noticed this? I am using numpy 1.10.2 and dedupe 1.2.2. With this leak dedupe is unable to process this modest data set. To get around this I changed the code to just store the results that get past the threshold in standard arrays, and then put all of the results into the numpy record array after the loop is finished to be consistent with the rest of dedupe. This fixed the memory issue for me. Now I can process the whole dataset in about 3-4 hours, and the memory is stable around 4 gb, only very slightly increasing over time due to storing the comparisons that are above the threshold, to maybe 4.5 gb total at the end. This is the code that I used to test and get around the leak. I wrote a new function scoreDuplicates2 which has the same input and output as scoreDuplicates but doesn’t do any multiprocessing queue’s so I had greater transparency. You also might notice there is a hacky zip function at the bottom and seemingly dumb list conversion. I did that because for some reason zip was actually calling itertools.izip and the recarrays call was being finicky about the exact format the lists were given in. Also I am using numpy.core.records.fromarrays rather than numpy.rec.arrays as it actually has a little bit of documentation and I am reasonably sure they call the same function.
def scoreDuplicates2(record_pairs, data_model, classifier, num_cores=1, threshold=0.5) :
final_ids1 = numpy.array([])
final_ids2 = numpy.array([])
final_scores = numpy.array([])
chunk_size = 100000
total = 0
while True:
ids1 = []
ids2 = []
records = []
chunk = itertools.islice(record_pairs, chunk_size)
next, chunk = peek(chunk)
if next:
for record_pair in chunk:
((id_1, record_1, smaller_ids_1),
(id_2, record_2, smaller_ids_2)) = record_pair
if set.isdisjoint(smaller_ids_1, smaller_ids_2) :
ids1.append(id_1)
ids2.append(id_2)
records.append((record_1, record_2))
if records :
distances = data_model.distances(records)
scores = classifier.predict_proba(distances)[:,-1]
mask = scores > threshold
final_ids1 = numpy.append(final_ids1, numpy.array(ids1)[mask])
final_ids2 = numpy.append(final_ids2, numpy.array(ids2)[mask])
final_scores = numpy.append(final_scores, scores[mask])
#scored_pairs = numpy.core.records.fromarrays((ids, scores),
# dtype= [('pairs', 'object', 2),
# ('score', 'f4', 1)])
total += chunk_size
else:
final_ids = []
for i in range(len(final_ids1)):
final_ids.append((final_ids1[i],final_ids2[i]))
final_scores = list(final_scores)
final_pairs = numpy.core.records.fromarrays((final_ids, final_scores),
dtype=[('pairs', 'object', 2),
('score', 'f4', 1)])
return final_pairs
Issue Analytics
- State:
- Created 8 years ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
Yes I could look at that
On Tue, May 3, 2016 at 2:15 PM, Forest Gregg notifications@github.com wrote:
[image: logo] *Lucas Adams *| *Data Scientist * 855.469.4737 | lucas.adams@wiser.com | www.wiser.com http://www.wiser.com/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature Address: 267 8th Street, San Francisco, CA 94103
http://facebook.com/wisepricer?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
http://twitter.com/wisepricer?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
http://linkedin.com/company/wiser?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature http://blog.wiser.com/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
@lucaswiser would you mind checking if this last commit fixes the memory leak you were seeing?