Possible bug in 'standard' lexer: the longest token is matched incorrectly
See original GitHub issueWhen I use the standard
lexer (default for lalr
parser) sometimes it determines the longest match incorrectly.
Consider the following code:
import lark
grammar = r'''
start: BINOP | ASSIGNMENT_OP
ASSIGNMENT_OP: "="
| "+="
BINOP: "*"
| "/"
| "%"
| "+"
| "-"
| "long_operator"
%import common.WS
%ignore WS
'''
parser = lark.Lark(grammar, parser='lalr')
code = '+='
print(list(parser.lex(code)))
It should print [Token(ASSIGNMENT_OP, '+=')]
because the longest match is +=
and the documentation says that in this case the longest match should be returned. However, the code prints [Token(BINOP, '+'), Token(ASSIGNMENT_OP, '=')]
, which is incorrect.
I suppose the lexer might be comparing not the lengths of actually matching terminals (+
and +=
in this case) but the lengths of the longest possible options (long_operator
and +=
)
P.S. lark version is 0.7.0, installed as a package python-lark-parser
in ArchLinux
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (4 by maintainers)
Top Results From Across the Web
The lexer chooses the wrong Token - antlr4 - Stack Overflow
I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together...
Read more >Practical parsing with Flex and Bison - begriffs.com
An important subtlety is how Lex handles multiple eligible matches. It picks the longest possible match available, and in the case of a...
Read more >2. Lexical analysis — Python 3.11.1 documentation
This chapter describes how the lexical analyzer breaks a file into tokens. ... Where ambiguity exists, a token comprises the longest possible string...
Read more >Chapter 1. Lex and Yacc - O'Reilly
Lex helps you by taking a set of descriptions of possible tokens and ... Lex executes the action for the longest possible match...
Read more >Ubuntu Manpage: lark - Lark Documentation
It's possible to bypass the dynamic lexing, and use the regular Earley parser with a traditional lexer, that tokenizes as an independent first...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
OK, now I see I was wrong. Thank you for your explanation
Incorrect.
a | b | c
is a regexp. Whenever you define a terminal, that is not a plain string, it’s a regexp.If you want to group different terminals together, that’s what rules are for.
If you think you can provide a better explanation, feel free to write one, and I might include it in the docs.