Store MinHashes for later queries
See original GitHub issueHi - I am using MinHashLSH for querying duplicate documents. This is exact flow of my use case.
- I have around 100K documents.
- when i open any document, it should show all other documents which are 90% match.
For this, i am first calculating minhashes for each document and then adding it below LSH object.
MinHashLSH(num_perm=perms,threshold=0.9)
Issue is - this min hashes creation process itself takes around 470 sec. So i can’t create it every time i query new document.
So I am planning to store this LSH object in disk for future queries and update it may be once in a day/week.
Did any one try this before or know any better way to handle this type of use case?
Thank You.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Storing MinHash for later use #122 - ekzhu/datasketch - GitHub
Can I store all these previously obtained MinHash functions and later when a new document enters the database I just MinHash that document ......
Read more >MinHash LSH — datasketch 1.0.0 documentation
LSH can be used with MinHash to achieve sub-linear query cost - that is a huge ... MinHash LSH supports using Redis as...
Read more >Jaccard Similarity and MinHash for winners - Robert Heaton
Since Twitter almost definitely store all their data in a single MySQL ... Several years later your query finishes JOINing several kajillion ...
Read more >Similarity & MinHash - YouTube
We discuss sets and the definition of the Jaccard coefficient, which we use to measure the degree of similarity between datasets.
Read more >Node.js / javascript minhash module that outputs a similar ...
Requires Douglas Duhaime's implementation of minhash, but any other implementation computing an array of hash values could be used the same ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For right deleting document from MinHashLSH index you should use “remove” method from MinHashLSH. See, documentation on API https://ekzhu.github.io/datasketch/documentation.html#minhash-lsh
Hi,
You should use pickling. See https://ekzhu.github.io/datasketch/lsh.html#connecting-to-existing-minhash-lsh