Earley and LALR parser do not agree on results
See original GitHub issueI’m not an expert on parsers, so this may not be a bug, but can’t hurt to file an issue…
I’m trying to create a parser for specially-formatted 68k assembly (subset):
m68kdis_parser = lark.Lark(r"""
program : line+
line : hexint WS_INLINE hexint WS_INLINE (label WS_INLINE)? insn WS_INLINE? NEWLINE
insn: two_op_insn
insn_name: UCASE_LETTER+
two_op_insn: insn_name WS_INLINE addr_offset "," areg
addr_offset: INT "(" areg ")"
label: "L" /[0-9]/+
areg: "A"/[0-7]/
hexint : HEXDIGIT+
%import common.WS_INLINE
%import common.CNAME
%import common.HEXDIGIT
%import common.UCASE_LETTER
%import common.NEWLINE
%import common.INT
%import common.WS
%ignore WS""", start="program", parser="lalr")
The following test string succeeds with the Earley parser, but fails with LALR:
00000940 47ed0080 LEA 128(A5),A3
---------------------------------------------------------------------------
UnexpectedCharacters Traceback (most recent call last)
<ipython-input-336-f9ec4572c60d> in <module>()
----> 1 m68kdis_parser.parse("00000940 47ed0080 LEA 128(A5),A3")
C:\msys64\mingw64\lib\python3.6\site-packages\lark\lark.py in parse(self, text)
221 def parse(self, text):
222 "Parse the given text, according to the options provided. Returns a tree, unless specified otherwise."
--> 223 return self.parser.parse(text)
224
225 # if self.profiler:
C:\msys64\mingw64\lib\python3.6\site-packages\lark\parser_frontends.py in parse(self, text)
36 token_stream = self.lex(text)
37 sps = self.lexer.set_parser_state
---> 38 return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
39
40 class LALR_TraditionalLexer(WithLexer):
C:\msys64\mingw64\lib\python3.6\site-packages\lark\parsers\lalr_parser.py in parse(self, seq, set_state)
66
67 # Main LALR-parser loop
---> 68 for i, token in enumerate(stream):
69 while True:
70 action, arg = get_action(token.type)
C:\msys64\mingw64\lib\python3.6\site-packages\lark\lexer.py in lex(self, stream)
257 def lex(self, stream):
258 l = _Lex(self.lexers[self.parser_state], self.parser_state)
--> 259 for x in l.lex(stream, self.root_lexer.newline_types, self.root_lexer.ignore_types):
260 yield x
261 l.lexer = self.lexers[self.parser_state]
C:\msys64\mingw64\lib\python3.6\site-packages\lark\lexer.py in lex(self, stream, newline_types, ignore_types)
103 else:
104 if line_ctr.char_pos < len(stream):
--> 105 raise UnexpectedCharacters(stream, line_ctr.char_pos, line_ctr.line, line_ctr.column, state=self.state)
106 break
107
UnexpectedCharacters: No terminal defined for 'E' at line 1 col 25
00000940 47ed0080 LEA 128(A5),A3
^
The problem is that “LEA” is being misinterpreted as a “label”, but since “labels” are optional, shouldn’t the parser be trying to match to an “insn” token? Or am I misunderstanding something fundamental?
Lark version is 0.6.4
Issue Analytics
- State:
- Created 5 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
Lark parser grammar works with Earley but not LALR
The reason this fails with LALR, is because it has a lookahead of 1 (unlike Earley, which has unlimited lookahead), and it gets...
Read more >A Faster Earley Parser - Computer Science
We present a parsing technique which is a hybrid of Earley's method and the LR(k) methods. The new method retains the ability of....
Read more >Practical Experience
bu-earley Bottom-up Earley parser. This is a bottom-up chart parser which records both active and inactive items. It operates in two phases and...
Read more >Is Earley parsing fast enough? - GitHub Pages
In theory, LALR based compilers are less dependent on procedural parsing and therefore easier to keep optimal. In practice they are as bad...
Read more >SPPF-Style Parsing From Earley Recognisers
themselves computer language designers do not naturally write LR(1) ... ANSI-standard grammar for C is ambiguous, but a longest match resolution results.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I use this exact case in Earley, so the current Earley parser will handle it. I don’t know about LALR. Just wanted to second the use case 😃 In my case I want to basically ignore all whitespace - but I have specific places in the files where I need to ensure there is specific whitespace.
This still ignores extraneous whitespace before and after TOKEN, without having to explicitly encode it but ensures there is a TAB or SPACE between the two tokens. The more useful use case here is the _EOL; which ensures my lines end with an _EOL which is semantically important but otherwise ignores empty lines and extra _EOLs without having to make every line _EOL+ (which is what I used to do).
There are many ways to solve this. You should think about what sort of errors you want to catch. If someone does an extra space, what happens? If someone forgets a number, what happens? Etc.
You can be very strict and do
Or you can make it more relaxed. It’s up to you and the interface you want to provide.