Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Upgrade duplicate finding logic with edit distance

See original GitHub issue

We could have a threshold (0.5 or 0.7) above which a conference could be flagged as a duplicate during import.

Issue Analytics

State:
Created 4 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

Sangarshanancommented, Oct 29, 2019

Thank you for picking this up @nishant-sethi 😃

You could add it as a utils functions that could get picked up by import cli command. Also I would suggest you use Levenshtein similarity and set the threshold a bit higher, around 0.9

1reaction

Sangarshanancommented, Oct 28, 2019

We could use Levenshtein or difflib.SequenceMatcher. There is a stackoverflow answer comparing the Performance and Tradeoffs of both

Read more comments on GitHub >

Top Results From Across the Web

Identifying Duplicate Records with Fuzzy Matching - Mawazo

Attribute Distance Threshold For example, if the the distance between the name fields for two customer records is significant enough, we can ...

Check if edit distance between two strings is one

A Simple Solution is to find Edit Distance using Dynamic programming. If distance is 1, then return true, else return false.

What are Fuzzy Matches in Salesforce Deduplication?

Finding duplicates involves comparing and evaluating a set of field values on a 'record A' and a 'record B'. This matching involves two...

Is there an efficient algorithm for fuzzy deduplication of string ...

If your measure of similarity is strong (e.g. Levenshtein distance 1), then you can process your string list in order, generating all possible...

How to speed up process of finding duplicates/similar items in ...

Edit distance is definitely not the way to proceed. The standard approach toward near-identical document deduplication is to compute hashes ...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Add motivational messages to conrad remind

Cannot parse ?? when specified at the end of function name