Improvement: Use Cython for Speed
See original GitHub issueHaving written a LALR parser for my language (https://github.com/eddieschoute/quippy) it still takes many seconds to parse a file of 100k LOC. One benchmark It takes very roughly 2m20s to parse a 600kLOC input file that I have, which is slow in my opinion. One straightforward improvement that I can think of is to use Cython to generate a C-implementation of the LALR parser. Most of the time seems to be spent in the main LALR parser loop, which can be significantly sped up by Cython. I would also be open to other suggestions to improve the parsing speed.
Since specifically the LALR parser is meant to compete in speed, I think it would be worth exploring the possibility of pushing this parser to its limit. Hopefully, converting the code to Cython code will be fairly painless and from there it just remains to optimize the functionality.
I do not know how the standalone parser will be affected by this, but I can image that instead of generating py
files it should instead generate a pyx
file that can be cythonized.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:29 (24 by maintainers)
Top GitHub Comments
I’ve put together a repo with the experiments at https://github.com/drslump/lark-lexer-experiments
It compares the current lexer implementation based on building a regexp with the different token expressions with the re2c based finite-state-machine bytecode generation and a C extension.
The finite-state machine approach is surprisingly good providing a ~50% speed up over the regex one. The C extension is obviously faster with a 300% speed up, but it can go up to 500% if it consumes all the tokens and returns a list when it’s done.
Overall I think that relying on
re2c
to generate a high performant lexer is a good approach, it’s an extra dependency but it produces very fast lexers. The FSM approach requires a preprocessing step but the result is platform independent, so it can be shipped with the code without requiring any compilation. Of course that for the highest performance the C extension is the way to go.The only big piece missing, besides handling unicode correctly, would be to implement the transform from the perl flavour regexes used in Lark grammars to the format understood by
re2c
. Not all regexes could be transformed though.By the way, if someone could try it on Pypy I’ll appreciate it, I can’t seem to install it on my laptop for some reason. I’m eager to see the results, if the FSM function is elegible for its JIT (it’s kind of big so not sure if they skip based on size), the performance should be really close to the C extension.
This is not a good idea, something this parsing library offers which nothing else has is the ability to create ridiculously portable parsers. The
zipimport
system used by Python allows you to make single executable parsers (as in a Python executable that consists of only one .zip file) that will run on literally any platform with a Python interpreter, batteries included, without requiring any compilation, configuration or setup. Thezipimport
system also makes the standalone parser superfluous as you can package Lark directly into the executable and the main script can usepkgutil
to pull in, say, a grammar file from within the archive.If you need speed then use PyPy; in my experience you can get a 5x speedup with it for free when using Lark. If you’re in a situation where slow compilation is costing you money and portability is not a concern then I would consider switching to a compiler compiler like Bison or Yacc.
If you want to add optional CPython/Cython extensions that are preferred over their pure Python counterparts that’s a happy medium, however additional care will need to be taken to ensure the compiled extensions behave exactly the same as their pure Python versions. This means writing unit tests for module APIs to make sure functions and classes remain in cohesion. If you want to retain support for PyPy with the compiled extensions that’s yet another layer of new work.
The group of contributors for this project are loosely organized and small, and duplicating the functionality of existing modules in compiled extensions will double the project’s workload. Maintaining the extensions in a manner that doesn’t make life more interesting will slow down bug fixes and improvements in the future as changes will have to be mirrored between two functionally identical, yet syntactically different, source codes. Odds are you’ll end up getting the extensions only partially complete, or modules will stop being maintained.
Or you could sacrifice
zipimport
and PyPy support for a modest speed improvement, a fatal blow to portability and a more complicated PyPI upload process. I say KIS.