Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SequenceLib is slower than python-Levenshtien

See original GitHub issue

I was adapting this code for our private use. I noticed that using python-Levenshtein package is 100x better than SequenceMatcher ratio.

Levenshtein.ratio(A, B) gets you the same result. I understand that this library is more for offline benchmarking use, but it doesn’t hurt to be faster 😉 .

btw Can you explain the rationale for the custom cost function for substitutions? Any example on how using it changes outcomes of the path taken.

Issue Analytics

State:
Created 5 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

sai-prasannacommented, Mar 9, 2019

Thanks for the reference. We noticed the speed improvements you had done that time (thanks).

BTW I have been cleaning up the code to make it easily usable as a library, making it pip installable with python 3 type annotations etc. If you plan to make this a library published to pypi, will give a pull request. Or I would consider publishing as a separate library.

One another side note, want to understand whether homophones can be added as another static rule for subsitutions, as they are among a easy class of errors people make …

P.S. I think I we can consider moving the Damerau Levenshtein code to cython for speed improvements. But thats a separate task in itself. Realized how slow python is compared to cython in python-Levenshtein.

1reaction

chrisjbryantcommented, Mar 7, 2019

Aha nice.

I was actually originally using the Damerau-Levenshtein code bundled with ERRANT (rdlextra.py) to do the character alignment, but found SequenceMatcher to be a lot faster. If python-Levenshtein is faster still, then that might definitely be something worth upgrading.

As for the rationale behind the custom substitution cost, it’s easiest to refer you to the alignment paper, particularly section 3.2 and Table 2.

Top Results From Across the Web

UserWarning: Using slow pure-python SequenceMatcher ...

This results in much slower process than any matcher which is based on C. That is why the warning shows up.

Fast Pythonic Levenshtein Library — Polyleven

You may notice that edlib and editdistance appear to be slower than other libraries. This is because both internally use Myers' algorithm for ......

Levenshtein 0.20.8 documentation - GitHub Pages

Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much...

edit-distance - PyPI

SequenceMatcher. This is very similar to difflib, except that this module computes edit distance (Levenshtein distance) rather than the Ratcliff and Oberhelp ...

dbOTU3: A new implementation of distribution-based OTU ...

dbOTU3, Python 2/3, Unaligned sequences, sequence count table, Levenshtein edit distance, Likelihood-ratio test ...