question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Upgrade duplicate finding logic with edit distance

See original GitHub issue

We could have a threshold (0.5 or 0.7) above which a conference could be flagged as a duplicate during import.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Sangarshanancommented, Oct 29, 2019

Thank you for picking this up @nishant-sethi 😃

You could add it as a utils functions that could get picked up by import cli command. Also I would suggest you use Levenshtein similarity and set the threshold a bit higher, around 0.9

1reaction
Sangarshanancommented, Oct 28, 2019

We could use Levenshtein or difflib.SequenceMatcher. There is a stackoverflow answer comparing the Performance and Tradeoffs of both

Read more comments on GitHub >

github_iconTop Results From Across the Web

Identifying Duplicate Records with Fuzzy Matching - Mawazo
Attribute Distance Threshold​​ For example, if the the distance between the name fields for two customer records is significant enough, we can ...
Read more >
Check if edit distance between two strings is one
A Simple Solution is to find Edit Distance using Dynamic programming. If distance is 1, then return true, else return false.
Read more >
What are Fuzzy Matches in Salesforce Deduplication?
Finding duplicates involves comparing and evaluating a set of field values on a 'record A' and a 'record B'. This matching involves two...
Read more >
Is there an efficient algorithm for fuzzy deduplication of string ...
If your measure of similarity is strong (e.g. Levenshtein distance 1), then you can process your string list in order, generating all possible...
Read more >
How to speed up process of finding duplicates/similar items in ...
Edit distance is definitely not the way to proceed. The standard approach toward near-identical document deduplication is to compute hashes ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found