question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Recommended KNN/ANN index for large datasets

See original GitHub issue

I would like to use CropResistantHash to quickly find near-duplicates from a large set of reference images.

With other hash functions I would normaly use some kind of approximate nearest neighbor index, such as NMSLib or Annoy. The challenge is that CropResistantHash is variable length and cannot be compared using one of the standard distance functions (Angular, Hamming, Manhattan, …).

Can anyone point me to an alternative solution? How do you use this with large datasets?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

0reactions
misotrnkacommented, Feb 2, 2021

I can try that, but I believe that any distance function in a DB would rely on full sequential scan of the table. We are talking about hundreds of millions of rows here, so I think some kind of index is neccessary to narrow the options down a bit. But thank you for the idea, I’ll explore it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Indexing Very Large Tables - Towards Data Science
A short guide to the best practices around indexing large tables and how to use partitioning to ease the load on indexing.
Read more >
postgresql - Indexing a large static dataset
Since your dataset is static, having tons and tons of indexes isn't a big problem. Each index has a cost for insert/update/delete.
Read more >
PostgreSQL Index on JSON on Large Data Sets - Stack Overflow
This there a performance hit on creating a index on a jsonb column? Reason is I have a very large data set which...
Read more >
Analyzing and Interpreting Large Datasets - CDC
For large datasets, analyze continuous variables (such as age) by determining the mean, median, standard deviation and interquartile range (IQR).
Read more >
Intro to data structures — pandas 1.5.2 documentation
The fundamental behavior about data types, indexing, axis labeling, and alignment apply across all of the objects. To get started, import NumPy and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found