[Question] Recommended KNN/ANN index for large datasets
See original GitHub issueI would like to use CropResistantHash to quickly find near-duplicates from a large set of reference images.
With other hash functions I would normaly use some kind of approximate nearest neighbor index, such as NMSLib or Annoy. The challenge is that CropResistantHash is variable length and cannot be compared using one of the standard distance functions (Angular, Hamming, Manhattan, …).
Can anyone point me to an alternative solution? How do you use this with large datasets?
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Indexing Very Large Tables - Towards Data Science
A short guide to the best practices around indexing large tables and how to use partitioning to ease the load on indexing.
Read more >postgresql - Indexing a large static dataset
Since your dataset is static, having tons and tons of indexes isn't a big problem. Each index has a cost for insert/update/delete.
Read more >PostgreSQL Index on JSON on Large Data Sets - Stack Overflow
This there a performance hit on creating a index on a jsonb column? Reason is I have a very large data set which...
Read more >Analyzing and Interpreting Large Datasets - CDC
For large datasets, analyze continuous variables (such as age) by determining the mean, median, standard deviation and interquartile range (IQR).
Read more >Intro to data structures — pandas 1.5.2 documentation
The fundamental behavior about data types, indexing, axis labeling, and alignment apply across all of the objects. To get started, import NumPy and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Some links that could be useful:
I can try that, but I believe that any distance function in a DB would rely on full sequential scan of the table. We are talking about hundreds of millions of rows here, so I think some kind of index is neccessary to narrow the options down a bit. But thank you for the idea, I’ll explore it.