question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Easy computation of all duplicates?

See original GitHub issue

Thanks for this library!

One question: Is there an easy way to get all pairs from a LSH? In other words, instead of a query for a single minhash, I need all pairs of records inside a same bucket.

I tried to dive into the code, but I didn’t understand very well how hashranges/hashtables work. What I need is similar to candidate_duplicates of this other implementation: https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

4reactions
fjsjcommented, Nov 6, 2019

@sameertikoo that’s called the transitivity problem on Record Linkage. Check those slides, page 32. Quick solution is to make the pairs a graph and compute the connected components. Use networkx for that (or networkit if your graph has millions of edges). For an introduction on Record Linkage and the Transitivity problem, I have a talk on deduplication with Python.

2reactions
ekzhucommented, Nov 7, 2019

@sameertikoo the expert has spoken

Read more comments on GitHub >

github_iconTop Results From Across the Web

Find and remove duplicates - Microsoft Support
Find and remove duplicates · Select the cells you want to check for duplicates. · Click Home > Conditional Formatting > Highlight Cells...
Read more >
How to find and remove duplicates in Google Sheets - Ablebits
No duplicates, no extra calculations. There are just unique records sorted out in one table. Remove duplicates — standard data cleanup tool.
Read more >
3 EASY Ways to Find and Remove Duplicates in Excel
Join 300000+ professionals in our courses: https://www.xelplus.com/courses/These are 3 easy ways to remove duplicates in your data to create ...
Read more >
Count Unique or Duplicate Values in a List - YouTube
Transcript · 3 EASY Ways to Find and Remove Duplicates in Excel · 7 Ways to Use Vlookup in Excel.
Read more >
3 Best Methods to Find Duplicates in Excel - Simon Sez IT
You can use any number or logical operator (<,>, etc) here that you prefer. In this case, the COUNTIF formula returns the number...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found