Contextual lexer matching terminal that can't be reached in context
See original GitHub issueI think this is a bug, but it may be possible I’m misunderstanding how the contextual lexer works.
The grammar I have has tokens based on the context, but until now it seemed to work ok using lalr
with the contextual
lexer, but after I added the return_stmt
it seems it started to do some wrong choices on how to lex the code.
In this case, where the grammar should parse a NAME_CONT
it’s parsing a DEC_NUMBER
.
Let me give some more info:
The grammar is trying to match the code:
*** Functions ***
My Function 1
While var 2
var 3
and it’s matching the 1
from My Function 1
as a DEC_NUMBER
instead of a NAME_CONT
(even though a DEC_NUMBER
isn’t valid at that context and a NAME_CONT
is (the identifier
rule is: identifier: NAME (NAME|NAME_CONT|WS)*
and it matched the NAME
for My
and NAME_CONT
for Function
, yet it then failed to match the 1
as NAME_CONT
and ended up matching as a DEC_NUMBER
, which isn’t valid in the context.
The place where the rules should be matching is at the func_name
– i.e.:
file_input: (_NEWLINE | root_stmt)*
?root_stmt: func_block
func_block: func_header WS? _NEWLINE (func_stmt)*
func_header: BLOCK_START "functions"i BLOCK_END
func_stmt: func_name func_suite
func_name: identifier
?identifier: NAME (NAME|NAME_CONT|WS)*
NAME: /[^\d\W]\w*/
NAME_CONT: /\w+/
WS: /\s+/
That alone works, but it fails in the actual grammar:
test_lark_contextual_lexer_issue.py.txt
This happened right after I added the return_stmt
!
So, in that same grammar just removing the return_stmt
the lexer seems to do the right thing – I’m not sure why though…
The docs say that:
The contextual lexer communicates with the parser, and uses the parser’s lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals.
Given that, I think it’s a bug in that the lexer is not considering the context as it should in this case – otherwise, I’m probably misunderstanding the approach on how it decides how to lex based on the context or the grammar can reach a DEC_NUMBER
through some means that I can’t see (as far as I see it the func_suite
needs a _NEWLINE
to start, so, it should not be possible to get to a DEC_NUMBER
from there).
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (18 by maintainers)
Top GitHub Comments
It’s actually parsing as a
DEC_NUMBER
.The full exception trace is:
That may be possible, but it is an extraneous complication just to shift the error from one place to another, that will have real performance implications. So, I don’t plan to do it. But I appreciate the suggestion.