Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use difflib instead of python-Levenshtein for computing similarity ratio

See original GitHub issue

The python-Levenshtein library has a GPLv2 license, meaning that the derived works must be available under the same license. Due to this, I, and presumably others, cannot use this library if we want a different license for works that use the word_forms library as a dependancy. (Thinking about this, it may mean that this particular library should also be GPLv2).

While looking for alternatives to this, I chanced upon Python’s own difflib library, and it’s SequenceMatcher.ratio() function. The output of this ratio, is exactly the same as the python-Levenshtein ratios. In fact, there is some overlap between the actual implementations of these libraries, as mentioned in the python-Levenshtein docs.

Code block to demonstrate this:

from difflib import SequenceMatcher
from Levenshtein import ratio

def sequence_matcher_ratio(a, b):
    return SequenceMatcher(None, a, b).ratio()

def compare_equality(a, b):
    print(sequence_matcher_ratio(a, b) == ratio(a, b))

def compare_print(a, b):
    print("Sequence Matcher Ratio: ", sequence_matcher_ratio(a, b))
    print("Levenshtein Ratio: ", ratio(a, b))

>>> compare_equality('continent', 'continence') 
True

>>> compare_print('continent', 'continence')
Sequence Matcher Ratio:  0.8421052631578947
Levenshtein Ratio:  0.8421052631578947

I propose we move to using SequenceMatcher, or some other library instead of the python-Levenshtein library.

I’m already doing this in my own fork, so I can raise a PR for this if needed.

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

gutfeelingcommented, Nov 30, 2020

@sajal2692 Feel free to send the PR.

0reactions

gutfeelingcommented, Dec 11, 2020

Closing this as the PR is now merged.

Top Results From Across the Web

difflib — Helpers for computing deltas — Python 3.11.1 ...

This is a class for comparing sequences of lines of text, and producing human-readable differences or deltas. Differ uses SequenceMatcher both to compare ......

High performance fuzzy string comparison in Python, use ...

difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number ...

Levenshtein Distance and Text Similarity in Python

The goal is to either find the exact occurrence (match) or to find an in-exact match using characters with a special meaning, for...

Is it using Levenshtein distance or the Ratcliff/Obershelp ...

As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences.

String Matching With FuzzyWuzzy - Towards Data Science

We used the ratio() function above to calculate the Levenshtein distance similarity ratio between the two strings (sequences). The similarity ...