Upgrade duplicate finding logic with edit distance
See original GitHub issueWe could have a threshold (0.5 or 0.7) above which a conference could be flagged as a duplicate during import.
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Identifying Duplicate Records with Fuzzy Matching - Mawazo
Attribute Distance Threshold For example, if the the distance between the name fields for two customer records is significant enough, we can ...
Read more >Check if edit distance between two strings is one
A Simple Solution is to find Edit Distance using Dynamic programming. If distance is 1, then return true, else return false.
Read more >What are Fuzzy Matches in Salesforce Deduplication?
Finding duplicates involves comparing and evaluating a set of field values on a 'record A' and a 'record B'. This matching involves two...
Read more >Is there an efficient algorithm for fuzzy deduplication of string ...
If your measure of similarity is strong (e.g. Levenshtein distance 1), then you can process your string list in order, generating all possible...
Read more >How to speed up process of finding duplicates/similar items in ...
Edit distance is definitely not the way to proceed. The standard approach toward near-identical document deduplication is to compute hashes ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thank you for picking this up @nishant-sethi 😃
You could add it as a utils functions that could get picked up by import cli command. Also I would suggest you use Levenshtein similarity and set the threshold a bit higher, around 0.9
We could use Levenshtein or difflib.SequenceMatcher. There is a stackoverflow answer comparing the Performance and Tradeoffs of both