Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Recommended KNN/ANN index for large datasets

See original GitHub issue

I would like to use CropResistantHash to quickly find near-duplicates from a large set of reference images.

With other hash functions I would normaly use some kind of approximate nearest neighbor index, such as NMSLib or Annoy. The challenge is that CropResistantHash is variable length and cannot be compared using one of the standard distance functions (Angular, Hamming, Manhattan, …).

Can anyone point me to an alternative solution? How do you use this with large datasets?

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

JohannesBuchnercommented, Feb 2, 2021

Some links that could be useful:

0reactions

misotrnkacommented, Feb 2, 2021

I can try that, but I believe that any distance function in a DB would rely on full sequential scan of the table. We are talking about hundreds of millions of rows here, so I think some kind of index is neccessary to narrow the options down a bit. But thank you for the idea, I’ll explore it.

Top Results From Across the Web

Indexing Very Large Tables - Towards Data Science

A short guide to the best practices around indexing large tables and how to use partitioning to ease the load on indexing.

postgresql - Indexing a large static dataset

Since your dataset is static, having tons and tons of indexes isn't a big problem. Each index has a cost for insert/update/delete.

PostgreSQL Index on JSON on Large Data Sets - Stack Overflow

This there a performance hit on creating a index on a jsonb column? Reason is I have a very large data set which...

Analyzing and Interpreting Large Datasets - CDC

For large datasets, analyze continuous variables (such as age) by determining the mean, median, standard deviation and interquartile range (IQR).

Intro to data structures — pandas 1.5.2 documentation

The fundamental behavior about data types, indexing, axis labeling, and alignment apply across all of the objects. To get started, import NumPy and...