question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Store MinHashes for later queries

See original GitHub issue

Hi - I am using MinHashLSH for querying duplicate documents. This is exact flow of my use case.

  1. I have around 100K documents.
  2. when i open any document, it should show all other documents which are 90% match.

For this, i am first calculating minhashes for each document and then adding it below LSH object.

MinHashLSH(num_perm=perms,threshold=0.9) Issue is - this min hashes creation process itself takes around 470 sec. So i can’t create it every time i query new document.

So I am planning to store this LSH object in disk for future queries and update it may be once in a day/week.

Did any one try this before or know any better way to handle this type of use case?

Thank You.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
aastafievcommented, Oct 17, 2018

For right deleting document from MinHashLSH index you should use “remove” method from MinHashLSH. See, documentation on API https://ekzhu.github.io/datasketch/documentation.html#minhash-lsh

1reaction
aastafievcommented, Oct 5, 2018
Read more comments on GitHub >

github_iconTop Results From Across the Web

Storing MinHash for later use #122 - ekzhu/datasketch - GitHub
Can I store all these previously obtained MinHash functions and later when a new document enters the database I just MinHash that document ......
Read more >
MinHash LSH — datasketch 1.0.0 documentation
LSH can be used with MinHash to achieve sub-linear query cost - that is a huge ... MinHash LSH supports using Redis as...
Read more >
Jaccard Similarity and MinHash for winners - Robert Heaton
Since Twitter almost definitely store all their data in a single MySQL ... Several years later your query finishes JOINing several kajillion ...
Read more >
Similarity & MinHash - YouTube
We discuss sets and the definition of the Jaccard coefficient, which we use to measure the degree of similarity between datasets.
Read more >
Node.js / javascript minhash module that outputs a similar ...
Requires Douglas Duhaime's implementation of minhash, but any other implementation computing an array of hash values could be used the same ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found