question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance tests / thoughts (~10e6 hashes)

See original GitHub issue

Firstly, congratulations on imagehash!

I’m using it for an application in which I have a few million ImageHash objects in a pandas DataFrame. I have a web server (a “microservice”) which loads all of these hashes in memory once (from a pickled file), and then outputs closest matches for a given hash.

i.e., this server allows other services to call it hence: http://example.com/?phash=… passing it the phash of an image (… the needle …) that’s then compared to the millions of stored/pre-computed hashes.

As these millions of hash comparisons were taking > 15 seconds on a pretty good machine, this made me look under the hood. Here’s what I uncovered using line_profiler:

Total time: 40.0547 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    66                                            @profile
    67                                            def __sub__(self, other):
    68   6585208    2452786.0      0.4      6.1     if other is None:
    69                                                raise TypeError('Other hash must not be None.')
    70   6585208    4903316.0      0.7     12.2     if self.hash.size != other.hash.size:
    71                                                raise TypeError('ImageHashes must be of the same shape.', self.hash.shape, other.hash.shape)
    72                                              # original code below, split up to profile each separate instruction
    73   6585208    8010861.0      1.2     20.0     flattened_h = self.hash.flatten()
    74   6585208    6635342.0      1.0     16.6     flattened_other_h = other.hash.flatten()
    75   6585208    6499394.0      1.0     16.2     sub_calc = flattened_h != flattened_other_h
    76   6585208    9317760.0      1.4     23.3     non_zero = numpy.count_nonzero(sub_calc)
    77   6585208    2235231.0      0.3      5.6     return non_zero

(the reported total time is slower than when running this code without the profiler, but the percentage values still hold)

Interestingly enough, the first two sanity if checks take up 18% of the time. First question: is it worth disabling those (but leaving users with more obscure error messages when __sub__ is called with incompatible arguments)? Would it be worth considering having a separate, “optimized” version of __sub__ that assumes that the user is passing correct values to it…?

The second, and most important finding, is that both .flatten() operations take up close to 40% of the running time.

I’ve modified my version of imagehash to pre-compute self.hash_flat once in __init__, and removed both sanity checks. Here’s the optimized result:

Total time: 16.0691 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    67                                            @profile
    68                                            def __sub__(self, other):
    69                                              # if other is None:
    70                                              #   raise TypeError('Other hash must not be None.')
    71                                              # if self.hash.size != other.hash.size:
    72                                              #   raise TypeError('ImageHashes must be of the same shape.', self.hash.shape, other.hash.shape)
    73                                           
    74                                              # optimized code
    75   6585208    7297545.0      1.1     45.4     sub_calc = self.hash_flat != other.hash_flat
    76   6585208    8771542.0      1.3     54.6     return numpy.count_nonzero(sub_calc)

Much better… but is this only better for my specific, “weird” application? 😄

I have some other thoughts/questions around numpy.count_nonzero, but perhaps we can save it for later/another issue.

Thanks again! Looking forward to reading your thoughts.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
JohannesBuchnercommented, Feb 8, 2018

Good! Yes, that performance is closer to what I expect.

If you do not need the distance but only if the two match exactly, you can make it even faster by converting the hash to a suitably large integer and using either a hashtable or database.

If you search for database+hamming distance, also stuff comes up.

0reactions
gregsadetskycommented, Feb 8, 2018

Thank you very very very much, this worked perfectly! Here’s the working code for posterity:

import os
import pandas as pd
from PIL import Image
import imagehash
import numpy

DIR = '/Users/greg/Desktop'

hashes = []
for filename in ['a.jpg', 'b.jpg', 'c.jpg']:
  with open(os.path.join(DIR, filename)) as f:
    hashes.append(imagehash.whash(Image.open(f)).hash.flatten())

hashes = numpy.array(hashes)

with open(os.path.join(DIR, 'needle.jpg')) as f:
  needle = imagehash.whash(Image.open(f)).hash.flatten()

print numpy.count_nonzero(needle != hashes, axis=1)

Changing from the Pandas Series to the NumPy array, as you suggested, was actually necessary in order for the code above to work.

I also found out that the axis argument to numpy.count_nonzero was added relatively recently. In the older version I was running locally, passing axis did not do anything, nor did numpy warn me that the argument was being silently dropped…

In any and all case, this is great! Thanks again.

My ~3M matches are taking closer to 2 seconds now. 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

Does "preallocating hash improve performance"? Or "using ...
Used as an lvalue, keys allows you to increase the number of hash buckets allocated for the given hash. This can gain you...
Read more >
Hash Table Performance Tests
Hash Table Performance Tests. Hash tables, like trees, are part of a family called associative maps, or associative arrays. They map a ...
Read more >
Improving the performance of equality or comparison ...
Hashing expressions can improve performance at equality-test stage. Test evaluations that involve variables from two different rule conditions can be time ...
Read more >
More Hash Function Tests
In the previous post, I wrote about non-crypto hash functions, and did some performance tests. Turns out, it's great to write about stuff!...
Read more >
How can I test that my hash function is good in terms of ...
But then the max-load 'order-of' results was presented, however the one with the 3 looked promising - but is it log2 or loge...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found