SequenceLib is slower than python-Levenshtien
See original GitHub issueI was adapting this code for our private use. I noticed that using
python-Levenshtein
package is 100x better than SequenceMatcher ratio.
Levenshtein.ratio(A, B) gets you the same result. I understand that this library is more for offline benchmarking use, but it doesn’t hurt to be faster 😉 .
btw Can you explain the rationale for the custom cost function for substitutions? Any example on how using it changes outcomes of the path taken.
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
UserWarning: Using slow pure-python SequenceMatcher ...
This results in much slower process than any matcher which is based on C. That is why the warning shows up.
Read more >Fast Pythonic Levenshtein Library — Polyleven
You may notice that edlib and editdistance appear to be slower than other libraries. This is because both internally use Myers' algorithm for ......
Read more >Levenshtein 0.20.8 documentation - GitHub Pages
Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much...
Read more >edit-distance - PyPI
SequenceMatcher. This is very similar to difflib, except that this module computes edit distance (Levenshtein distance) rather than the Ratcliff and Oberhelp ...
Read more >dbOTU3: A new implementation of distribution-based OTU ...
dbOTU3, Python 2/3, Unaligned sequences, sequence count table, Levenshtein edit distance, Likelihood-ratio test ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the reference. We noticed the speed improvements you had done that time (thanks).
BTW I have been cleaning up the code to make it easily usable as a library, making it pip installable with python 3 type annotations etc. If you plan to make this a library published to pypi, will give a pull request. Or I would consider publishing as a separate library.
One another side note, want to understand whether homophones can be added as another static rule for subsitutions, as they are among a easy class of errors people make …
P.S. I think I we can consider moving the Damerau Levenshtein code to cython for speed improvements. But thats a separate task in itself. Realized how slow python is compared to cython in python-Levenshtein.
Aha nice.
I was actually originally using the Damerau-Levenshtein code bundled with ERRANT (rdlextra.py) to do the character alignment, but found SequenceMatcher to be a lot faster. If python-Levenshtein is faster still, then that might definitely be something worth upgrading.
As for the rationale behind the custom substitution cost, it’s easiest to refer you to the alignment paper, particularly section 3.2 and Table 2.