Use difflib instead of python-Levenshtein for computing similarity ratio
See original GitHub issueThe python-Levenshtein library has a GPLv2 license, meaning that the derived works must be available under the same license. Due to this, I, and presumably others, cannot use this library if we want a different license for works that use the word_forms library as a dependancy. (Thinking about this, it may mean that this particular library should also be GPLv2).
While looking for alternatives to this, I chanced upon Python’s own difflib library, and it’s SequenceMatcher.ratio() function. The output of this ratio, is exactly the same as the python-Levenshtein ratios. In fact, there is some overlap between the actual implementations of these libraries, as mentioned in the python-Levenshtein docs.
Code block to demonstrate this:
from difflib import SequenceMatcher
from Levenshtein import ratio
def sequence_matcher_ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
def compare_equality(a, b):
print(sequence_matcher_ratio(a, b) == ratio(a, b))
def compare_print(a, b):
print("Sequence Matcher Ratio: ", sequence_matcher_ratio(a, b))
print("Levenshtein Ratio: ", ratio(a, b))
>>> compare_equality('continent', 'continence')
True
>>> compare_print('continent', 'continence')
Sequence Matcher Ratio: 0.8421052631578947
Levenshtein Ratio: 0.8421052631578947
I propose we move to using SequenceMatcher, or some other library instead of the python-Levenshtein library.
I’m already doing this in my own fork, so I can raise a PR for this if needed.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
@sajal2692 Feel free to send the PR.
Closing this as the PR is now merged.