Lark parse performance and parent / child indentation
See original GitHub issueWhat is your question?
I have a text parser (ciscoconfparse) which I use to parse various configurations. ciscoconfparse parent / child line assignment happens based on whether the child candidate line is indented more than the parent is indented. See Example01 below.
All code in this github issue was parsed under Debian 10 and Python3.7:
- Lark version 0.11.1
- ciscoconfpasre version 1.5.22
If you’re having trouble with your code or grammar
I am interested in converting my script to use Lark, but I can’t stomach the Lark performance hit that I see in Example01. ciscoconfparse is 20x faster than Lark’s parse times (of the same text).
- Is there a way for me to accelerate Lark’s parse time in Example01?
- So far, I can’t find a way to extract
indent_level
for the Lark-parsed config. Can someone give me assistance with how I can quantify the indent-level of each Lark-parsed line in Example01?
As recommended in Lark’s docs, the NL
Token parses the next-lines’ whitespace indent. I realize I could carry the indent_level
over from the previous-line’s NL
but I hope for a simpler option than that.
Explain what you’re trying to do, and what is obstructing your progress.
# Example01
# File: lark_test_indent.py
import time
from lark import Lark
from lark.indenter import Indenter
from ciscoconfparse import CiscoConfParse
from snoop import Config
config = Config(color=True)
snoop = config.snoop
class MyTreeIndenter(Indenter):
tab_len = 1
NL_type = 'NL'
OPEN_PAREN_types = []
CLOSE_PAREN_types = []
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
grammar = r"""
start : (empty_line
| instruction
| comment)+
comment : (_INDENT | _DEDENT)* COMMENT NL
instruction : (_INDENT | _DEDENT)* INSTRUCTION NL
empty_line : NL+
// Tokens are defined with uppercase names...
COMMENT : /\![^\n]*/
INSTRUCTION : /\w[^\n]*/
// NL *must* match leading whitespace on new lines...
NL : /(\r?\n)+[\s\t]*/
%import common.WS_INLINE
%ignore WS_INLINE
%declare _INDENT _DEDENT
"""
text_01 = """
vehicle for Jim
600 horsepower
18 wheels
red color
2 doors
chrome handles outside
no pushbutton lock outside
"""
text_02 = """
vehicle for Jim
600 horsepower
18 wheels
red color
2 doors
chrome handles outside
pushbutton lock outside
vehicle for Charles
320 horsepower
4 wheels
green color
4 doors
!
chrome handles outside
!no pushbutton lock outside
!
"""
#@snoop(depth=2)
def lark_parse(instructions=""):
parser = Lark(grammar, parser='lalr', postlex=MyTreeIndenter())
parse_tree = parser.parse(instructions)
for inst_obj in parse_tree.children:
print("INSTRUCTION", str(inst_obj))
def ccp(instructions=""):
parse = CiscoConfParse(instructions.splitlines())
for obj in parse.ConfigObjs:
print(obj, "Comment:", obj.is_comment)
if __name__=='__main__':
begin = time.time()
lark_parse(text_02)
end = round(time.time() - begin, 6)
print("Lark parse time: %s" % end)
print("---------------------------")
begin = time.time()
ccp(text_02)
end = round(time.time() - begin, 6)
print("CiscoConfParse parse time: %s" % end)
Example01 parse output:
$ python lark_test_indent.py
INSTRUCTION Tree('empty_line', [Token('NL', '\n')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', 'vehicle for Jim'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', '600 horsepower'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', '18 wheels'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', 'red color'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', '2 doors'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', 'chrome handles outside'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', 'pushbutton lock outside'), Token('NL', '\n')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', 'vehicle for Charles'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', '320 horsepower'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', '4 wheels'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', 'green color'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', '4 doors'), Token('NL', '\n ')])
INSTRUCTION Tree('comment', [Token('COMMENT', '!'), Token('NL', '\n ')])
INSTRUCTION Tree('instruction', [Token('INSTRUCTION', 'chrome handles outside'), Token('NL', '\n ')])
INSTRUCTION Tree('comment', [Token('COMMENT', '!no pushbutton lock outside'), Token('NL', '\n')])
INSTRUCTION Tree('comment', [Token('COMMENT', '!'), Token('NL', '\n')])
Lark parse time: 0.037327
---------------------------
<IOSCfgLine # 0 'vehicle for Jim'> Comment: False
<IOSCfgLine # 1 ' 600 horsepower' (parent is # 0)> Comment: False
<IOSCfgLine # 2 ' 18 wheels' (parent is # 0)> Comment: False
<IOSCfgLine # 3 ' red color' (parent is # 0)> Comment: False
<IOSCfgLine # 4 ' 2 doors' (parent is # 0)> Comment: False
<IOSCfgLine # 5 ' chrome handles outside' (parent is # 4)> Comment: False
<IOSCfgLine # 6 ' pushbutton lock outside' (parent is # 4)> Comment: False
<IOSCfgLine # 7 'vehicle for Charles'> Comment: False
<IOSCfgLine # 8 ' 320 horsepower' (parent is # 7)> Comment: False
<IOSCfgLine # 9 ' 4 wheels' (parent is # 7)> Comment: False
<IOSCfgLine # 10 ' green color' (parent is # 7)> Comment: False
<IOSCfgLine # 11 ' 4 doors' (parent is # 7)> Comment: False
<IOSCfgLine # 12 ' !' (parent is # 7)> Comment: True
<IOSCfgLine # 13 ' chrome handles outside' (parent is # 11)> Comment: False
<IOSCfgLine # 14 ' !no pushbutton lock outside' (parent is # 11)> Comment: True
<IOSCfgLine # 15 '!'> Comment: True
CiscoConfParse parse time: 0.001478
$
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
That depends on what your goal is. Your current grammar is probably not finished. You should first make the grammar work correctly and then make it faster.
I answered your question, which said nothing about parsing with Lark.
At this point, I do not have a complete Lark parser. Given the performance gap between Lark and ciscoconfparse’s original parser, I’m still consideridering the cost / benefit analysis of parsing with Lark.