question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use difflib instead of python-Levenshtein for computing similarity ratio

See original GitHub issue

The python-Levenshtein library has a GPLv2 license, meaning that the derived works must be available under the same license. Due to this, I, and presumably others, cannot use this library if we want a different license for works that use the word_forms library as a dependancy. (Thinking about this, it may mean that this particular library should also be GPLv2).

While looking for alternatives to this, I chanced upon Python’s own difflib library, and it’s SequenceMatcher.ratio() function. The output of this ratio, is exactly the same as the python-Levenshtein ratios. In fact, there is some overlap between the actual implementations of these libraries, as mentioned in the python-Levenshtein docs.

Code block to demonstrate this:

from difflib import SequenceMatcher
from Levenshtein import ratio

def sequence_matcher_ratio(a, b):
    return SequenceMatcher(None, a, b).ratio()

def compare_equality(a, b):
    print(sequence_matcher_ratio(a, b) == ratio(a, b))

def compare_print(a, b):
    print("Sequence Matcher Ratio: ", sequence_matcher_ratio(a, b))
    print("Levenshtein Ratio: ", ratio(a, b))

>>> compare_equality('continent', 'continence') 
True

>>> compare_print('continent', 'continence')
Sequence Matcher Ratio:  0.8421052631578947
Levenshtein Ratio:  0.8421052631578947

I propose we move to using SequenceMatcher, or some other library instead of the python-Levenshtein library.

I’m already doing this in my own fork, so I can raise a PR for this if needed.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
gutfeelingcommented, Nov 30, 2020

@sajal2692 Feel free to send the PR.

0reactions
gutfeelingcommented, Dec 11, 2020

Closing this as the PR is now merged.

Read more comments on GitHub >

github_iconTop Results From Across the Web

difflib — Helpers for computing deltas — Python 3.11.1 ...
This is a class for comparing sequences of lines of text, and producing human-readable differences or deltas. Differ uses SequenceMatcher both to compare ......
Read more >
High performance fuzzy string comparison in Python, use ...
difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number ...
Read more >
Levenshtein Distance and Text Similarity in Python
The goal is to either find the exact occurrence (match) or to find an in-exact match using characters with a special meaning, for...
Read more >
Is it using Levenshtein distance or the Ratcliff/Obershelp ...
As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences.
Read more >
String Matching With FuzzyWuzzy - Towards Data Science
We used the ratio() function above to calculate the Levenshtein distance similarity ratio between the two strings (sequences). The similarity ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found