Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improvement: Use Cython for Speed

See original GitHub issue

Having written a LALR parser for my language (https://github.com/eddieschoute/quippy) it still takes many seconds to parse a file of 100k LOC. One benchmark It takes very roughly 2m20s to parse a 600kLOC input file that I have, which is slow in my opinion. One straightforward improvement that I can think of is to use Cython to generate a C-implementation of the LALR parser. Most of the time seems to be spent in the main LALR parser loop, which can be significantly sped up by Cython. I would also be open to other suggestions to improve the parsing speed.

Since specifically the LALR parser is meant to compete in speed, I think it would be worth exploring the possibility of pushing this parser to its limit. Hopefully, converting the code to Cython code will be fairly painless and from there it just remains to optimize the functionality.

I do not know how the standalone parser will be affected by this, but I can image that instead of generating py files it should instead generate a pyx file that can be cythonized.

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:29 (24 by maintainers)

Top GitHub Comments

2reactions

drslumpcommented, Apr 11, 2018

I’ve put together a repo with the experiments at https://github.com/drslump/lark-lexer-experiments

It compares the current lexer implementation based on building a regexp with the different token expressions with the re2c based finite-state-machine bytecode generation and a C extension.

The finite-state machine approach is surprisingly good providing a ~50% speed up over the regex one. The C extension is obviously faster with a 300% speed up, but it can go up to 500% if it consumes all the tokens and returns a list when it’s done.

Overall I think that relying on re2c to generate a high performant lexer is a good approach, it’s an extra dependency but it produces very fast lexers. The FSM approach requires a preprocessing step but the result is platform independent, so it can be shipped with the code without requiring any compilation. Of course that for the highest performance the C extension is the way to go.

The only big piece missing, besides handling unicode correctly, would be to implement the transform from the perl flavour regexes used in Lark grammars to the format understood by re2c. Not all regexes could be transformed though.


-------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------
Name (time in us)            Min                   Max                  Mean              StdDev                Median               OPS
-----------------------------------------------------------------------------------------------------------------------------------------------
test_cseq               442.9817 (1.0)        757.9327 (1.0)        483.4539 (1.0)       26.1906 (1.0)        478.0293 (1.0)  2,068.4495 (1.0)
test_c                  790.1192 (1.78)     1,079.7977 (1.42)       822.6287 (1.70)      40.6050 (1.55)       796.0796 (1.67) 1,215.6152 (0.59)
test_fsm              1,598.8350 (3.61)     1,986.0268 (2.62)     1,677.9255 (3.47)      88.4932 (3.38)     1,627.9221 (3.41)   595.9740 (0.29)
test_regex            2,313.8523 (5.22)     2,911.8061 (3.84)     2,500.1628 (5.17)     128.5901 (4.91)     2,499.1035 (5.23)   399.9740 (0.19)
-----------------------------------------------------------------------------------------------------------------------------------------------

By the way, if someone could try it on Pypy I’ll appreciate it, I can’t seem to install it on my laptop for some reason. I’m eager to see the results, if the FSM function is elegible for its JIT (it’s kind of big so not sure if they skip based on size), the performance should be really close to the C extension.

2reactions

psboycecommented, Mar 24, 2018

This is not a good idea, something this parsing library offers which nothing else has is the ability to create ridiculously portable parsers. The zipimport system used by Python allows you to make single executable parsers (as in a Python executable that consists of only one .zip file) that will run on literally any platform with a Python interpreter, batteries included, without requiring any compilation, configuration or setup. The zipimport system also makes the standalone parser superfluous as you can package Lark directly into the executable and the main script can use pkgutil to pull in, say, a grammar file from within the archive.

If you need speed then use PyPy; in my experience you can get a 5x speedup with it for free when using Lark. If you’re in a situation where slow compilation is costing you money and portability is not a concern then I would consider switching to a compiler compiler like Bison or Yacc.

If you want to add optional CPython/Cython extensions that are preferred over their pure Python counterparts that’s a happy medium, however additional care will need to be taken to ensure the compiled extensions behave exactly the same as their pure Python versions. This means writing unit tests for module APIs to make sure functions and classes remain in cohesion. If you want to retain support for PyPy with the compiled extensions that’s yet another layer of new work.

The group of contributors for this project are loosely organized and small, and duplicating the functionality of existing modules in compiled extensions will double the project’s workload. Maintaining the extensions in a manner that doesn’t make life more interesting will slow down bug fixes and improvements in the future as changes will have to be mirrored between two functionally identical, yet syntactically different, source codes. Odds are you’ll end up getting the extensions only partially complete, or modules will stop being maintained.

Or you could sacrifice zipimport and PyPy support for a modest speed improvement, a fatal blow to portability and a more complicated PyPI upload process. I say KIS.