question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SequenceLib is slower than python-Levenshtien

See original GitHub issue

I was adapting this code for our private use. I noticed that using python-Levenshtein package is 100x better than SequenceMatcher ratio.

Levenshtein.ratio(A, B) gets you the same result. I understand that this library is more for offline benchmarking use, but it doesn’t hurt to be faster 😉 .

btw Can you explain the rationale for the custom cost function for substitutions? Any example on how using it changes outcomes of the path taken.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sai-prasannacommented, Mar 9, 2019

Thanks for the reference. We noticed the speed improvements you had done that time (thanks).

BTW I have been cleaning up the code to make it easily usable as a library, making it pip installable with python 3 type annotations etc. If you plan to make this a library published to pypi, will give a pull request. Or I would consider publishing as a separate library.

One another side note, want to understand whether homophones can be added as another static rule for subsitutions, as they are among a easy class of errors people make …

P.S. I think I we can consider moving the Damerau Levenshtein code to cython for speed improvements. But thats a separate task in itself. Realized how slow python is compared to cython in python-Levenshtein.

1reaction
chrisjbryantcommented, Mar 7, 2019

Aha nice.

I was actually originally using the Damerau-Levenshtein code bundled with ERRANT (rdlextra.py) to do the character alignment, but found SequenceMatcher to be a lot faster. If python-Levenshtein is faster still, then that might definitely be something worth upgrading.

As for the rationale behind the custom substitution cost, it’s easiest to refer you to the alignment paper, particularly section 3.2 and Table 2.

Read more comments on GitHub >

github_iconTop Results From Across the Web

UserWarning: Using slow pure-python SequenceMatcher ...
This results in much slower process than any matcher which is based on C. That is why the warning shows up.
Read more >
Fast Pythonic Levenshtein Library — Polyleven
You may notice that edlib and editdistance appear to be slower than other libraries. This is because both internally use Myers' algorithm for ......
Read more >
Levenshtein 0.20.8 documentation - GitHub Pages
Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much...
Read more >
edit-distance - PyPI
SequenceMatcher. This is very similar to difflib, except that this module computes edit distance (Levenshtein distance) rather than the Ratcliff and Oberhelp ...
Read more >
dbOTU3: A new implementation of distribution-based OTU ...
dbOTU3, Python 2/3, Unaligned sequences, sequence count table, Levenshtein edit distance, Likelihood-ratio test ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found