Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Earley and LALR parser do not agree on results

See original GitHub issue

I’m not an expert on parsers, so this may not be a bug, but can’t hurt to file an issue…

I’m trying to create a parser for specially-formatted 68k assembly (subset):

m68kdis_parser = lark.Lark(r"""
    program : line+
    line : hexint WS_INLINE hexint WS_INLINE (label WS_INLINE)? insn WS_INLINE? NEWLINE
    insn: two_op_insn

    insn_name: UCASE_LETTER+
    two_op_insn: insn_name WS_INLINE addr_offset "," areg

    addr_offset: INT "(" areg ")"
    label: "L" /[0-9]/+

    areg: "A"/[0-7]/

    hexint : HEXDIGIT+

    %import common.WS_INLINE
    %import common.CNAME
    %import common.HEXDIGIT
    %import common.UCASE_LETTER
    %import common.NEWLINE
    %import common.INT
    %import common.WS
    %ignore WS""", start="program", parser="lalr")

The following test string succeeds with the Earley parser, but fails with LALR:

00000940   47ed0080				LEA	128(A5),A3

---------------------------------------------------------------------------
UnexpectedCharacters                      Traceback (most recent call last)
<ipython-input-336-f9ec4572c60d> in <module>()
----> 1 m68kdis_parser.parse("00000940   47ed0080                               LEA     128(A5),A3")

C:\msys64\mingw64\lib\python3.6\site-packages\lark\lark.py in parse(self, text)
    221     def parse(self, text):
    222         "Parse the given text, according to the options provided. Returns a tree, unless specified otherwise."
--> 223         return self.parser.parse(text)
    224 
    225         # if self.profiler:

C:\msys64\mingw64\lib\python3.6\site-packages\lark\parser_frontends.py in parse(self, text)
     36         token_stream = self.lex(text)
     37         sps = self.lexer.set_parser_state
---> 38         return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
     39 
     40 class LALR_TraditionalLexer(WithLexer):

C:\msys64\mingw64\lib\python3.6\site-packages\lark\parsers\lalr_parser.py in parse(self, seq, set_state)
     66 
     67         # Main LALR-parser loop
---> 68         for i, token in enumerate(stream):
     69             while True:
     70                 action, arg = get_action(token.type)

C:\msys64\mingw64\lib\python3.6\site-packages\lark\lexer.py in lex(self, stream)
    257     def lex(self, stream):
    258         l = _Lex(self.lexers[self.parser_state], self.parser_state)
--> 259         for x in l.lex(stream, self.root_lexer.newline_types, self.root_lexer.ignore_types):
    260             yield x
    261             l.lexer = self.lexers[self.parser_state]

C:\msys64\mingw64\lib\python3.6\site-packages\lark\lexer.py in lex(self, stream, newline_types, ignore_types)
    103             else:
    104                 if line_ctr.char_pos < len(stream):
--> 105                     raise UnexpectedCharacters(stream, line_ctr.char_pos, line_ctr.line, line_ctr.column, state=self.state)
    106                 break
    107 

UnexpectedCharacters: No terminal defined for 'E' at line 1 col 25

00000940   47ed0080				LEA	128(A5),A3
                        ^

The problem is that “LEA” is being misinterpreted as a “label”, but since “labels” are optional, shouldn’t the parser be trying to match to an “insn” token? Or am I misunderstanding something fundamental?

Lark version is 0.6.4

Issue Analytics

State:
Created 5 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

night199ukcommented, Dec 5, 2018

I use this exact case in Earley, so the current Earley parser will handle it. I don’t know about LALR. Just wanted to second the use case 😃 In my case I want to basically ignore all whitespace - but I have specific places in the files where I need to ensure there is specific whitespace.

TABSPACE: TAB|SPACE
_TS: TABSPACE
_MTS:   TABSPACE  // Mandatory Tab/Space
_EOL: CR | LF | ( CR LF )

rule: TOKEN _MTS TOKEN _EOL

%ignore _TS
%ignore _EOL

This still ignores extraneous whitespace before and after TOKEN, without having to explicitly encode it but ensures there is a TAB or SPACE between the two tokens. The more useful use case here is the _EOL; which ensures my lines end with an _EOL which is semantically important but otherwise ignores empty lines and extra _EOLs without having to make every line _EOL+ (which is what I used to do).

0reactions

erezshcommented, Jan 4, 2019

There are many ways to solve this. You should think about what sort of errors you want to catch. If someone does an extra space, what happens? If someone forgets a number, what happens? Etc.

You can be very strict and do

FIELD134: " "~12
           | " " ~11  DIGIT
           | " "~10  DIGIT~2
            ....
           | DIGIT ~ 12

Or you can make it more relaxed. It’s up to you and the interface you want to provide.

Top Results From Across the Web

Lark parser grammar works with Earley but not LALR

The reason this fails with LALR, is because it has a lookahead of 1 (unlike Earley, which has unlimited lookahead), and it gets...

A Faster Earley Parser - Computer Science

We present a parsing technique which is a hybrid of Earley's method and the LR(k) methods. The new method retains the ability of....

Practical Experience

bu-earley Bottom-up Earley parser. This is a bottom-up chart parser which records both active and inactive items. It operates in two phases and...

Is Earley parsing fast enough? - GitHub Pages

In theory, LALR based compilers are less dependent on procedural parsing and therefore easier to keep optimal. In practice they are as bad...

SPPF-Style Parsing From Earley Recognisers

themselves computer language designers do not naturally write LR(1) ... ANSI-standard grammar for C is ambiguous, but a longest match resolution results.