Performance tests / thoughts (~10e6 hashes)
See original GitHub issueFirstly, congratulations on imagehash!
I’m using it for an application in which I have a few million ImageHash objects in a pandas DataFrame. I have a web server (a “microservice”) which loads all of these hashes in memory once (from a pickled file), and then outputs closest matches for a given hash.
i.e., this server allows other services to call it hence: http://example.com/?phash=… passing it the phash of an image (… the needle …) that’s then compared to the millions of stored/pre-computed hashes.
As these millions of hash comparisons were taking > 15 seconds on a pretty good machine, this made me look under the hood. Here’s what I uncovered using line_profiler:
Total time: 40.0547 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
66 @profile
67 def __sub__(self, other):
68 6585208 2452786.0 0.4 6.1 if other is None:
69 raise TypeError('Other hash must not be None.')
70 6585208 4903316.0 0.7 12.2 if self.hash.size != other.hash.size:
71 raise TypeError('ImageHashes must be of the same shape.', self.hash.shape, other.hash.shape)
72 # original code below, split up to profile each separate instruction
73 6585208 8010861.0 1.2 20.0 flattened_h = self.hash.flatten()
74 6585208 6635342.0 1.0 16.6 flattened_other_h = other.hash.flatten()
75 6585208 6499394.0 1.0 16.2 sub_calc = flattened_h != flattened_other_h
76 6585208 9317760.0 1.4 23.3 non_zero = numpy.count_nonzero(sub_calc)
77 6585208 2235231.0 0.3 5.6 return non_zero
(the reported total time is slower than when running this code without the profiler, but the percentage values still hold)
Interestingly enough, the first two sanity if
checks take up 18% of the time. First question: is it worth disabling those (but leaving users with more obscure error messages when __sub__
is called with incompatible arguments)? Would it be worth considering having a separate, “optimized” version of __sub__
that assumes that the user is passing correct values to it…?
The second, and most important finding, is that both .flatten()
operations take up close to 40% of the running time.
I’ve modified my version of imagehash to pre-compute self.hash_flat
once in __init__
, and removed both sanity checks. Here’s the optimized result:
Total time: 16.0691 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
67 @profile
68 def __sub__(self, other):
69 # if other is None:
70 # raise TypeError('Other hash must not be None.')
71 # if self.hash.size != other.hash.size:
72 # raise TypeError('ImageHashes must be of the same shape.', self.hash.shape, other.hash.shape)
73
74 # optimized code
75 6585208 7297545.0 1.1 45.4 sub_calc = self.hash_flat != other.hash_flat
76 6585208 8771542.0 1.3 54.6 return numpy.count_nonzero(sub_calc)
Much better… but is this only better for my specific, “weird” application? 😄
I have some other thoughts/questions around numpy.count_nonzero
, but perhaps we can save it for later/another issue.
Thanks again! Looking forward to reading your thoughts.
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (3 by maintainers)
Good! Yes, that performance is closer to what I expect.
If you do not need the distance but only if the two match exactly, you can make it even faster by converting the hash to a suitably large integer and using either a hashtable or database.
If you search for database+hamming distance, also stuff comes up.
Thank you very very very much, this worked perfectly! Here’s the working code for posterity:
Changing from the Pandas Series to the NumPy array, as you suggested, was actually necessary in order for the code above to work.
I also found out that the
axis
argument tonumpy.count_nonzero
was added relatively recently. In the older version I was running locally, passingaxis
did not do anything, nor did numpy warn me that the argument was being silently dropped…In any and all case, this is great! Thanks again.
My ~3M matches are taking closer to 2 seconds now. 😄