Strange lexer behaviour
See original GitHub issueDescribe the bug
I have found an incoherent behaviour in the lexer in a quite particular case. I’m writing a grammar to write a pseudo query language mixing words/tokens with AND, OR, quotes, and parenthesis. The main idea is that the AND and OR operators apply to just one token/quoted string, unless they are within a parenthesis expression, where we will expand the operands to all the tokens leftwise or rightwise. For example, A B AND C => the operands of AND would be “B” and “C” while (A B AND C) => the operands of AND would be “A B” and “C”
I’ve defined WORD as /\S*[^)]/ to avoid collisions with the ending parenthesis in case there is no space to define the token properly. Depending on whether I’m within a parenthesis or not and, if inside, whether there are one or more tokens, the regular expression is evaluated differently.
To Reproduce
The grammar and behaviour:
`from lark import Lark
parser = Lark(r"“”
start: expr+
expr: expr OP_OR and_expr
| and_expr
and_expr: and_expr OP_AND operand
| operand
operand: WORD
| QUOTED_EXPR
| par_expr
par_expr: PAR_OPEN broaden_expr PAR_CLOSE
broaden_expr: broaden_expr OP_OR broaden_and_expr
| broaden_and_expr
broaden_and_expr: broaden_and_expr OP_AND broaden_operand
| broaden_operand
broaden_operand: WORD+
| QUOTED_EXPR
| par_expr
OP_OR.10: "OR"
OP_AND.10: "AND"
OP_QUOTES.10: "\""
PAR_OPEN.10: "("
PAR_CLOSE.10: ")"
QUOTED_EXPR.10: /\".*?\"/ // to make the regular expression greedy
WORD: /\S*[^)]/
%import common.WS // imports from terminal library
%ignore WS
""", parser="lalr")`
Then, with the following inputs: query=‘(operand AND whatever)’ :: OK query=‘(operand AND whatever dfa)’ :: OK query=‘(operand AND whatever dfa )’ :: OK query=‘(operand) AND (whatever)’ :: NOK => parses operand) and whatever) as the tokens query=‘(operand ) AND (whatever )’ :: OK => distinguishes the ) token query=‘(operand) AND (whatever dfa)’ :: NOK => parses operand) as token query='(operand ) AND (whatever dfa) :: OK => dfa and ) are parsed as different tokens
It seems that the + applied to WORD in broaden_operand affects the way the regex used for defining words is evaluated.
BTW, for my purposes I’ve solved it by changing the regex to directly /[^)\s]/+ but I thought that the behaviour should be notified to see whether it was the way the regex evaluation should work.
Issue Analytics
- State:
- Created 2 years ago
- Comments:20 (10 by maintainers)
Top GitHub Comments
Sure, I think (especially for people that had previous experience with other parser tools) that this would help in order to specify where the %ignore directive applies (bear in mind that in lark there is not so much separation between the lexer and the grammar parser as rules and terminal definitions are not so isolated from one each other as in other parsing tools).
Thank you!
%ignore
All occurrences of the terminal will be ignored, and won’t be part of the parse. Note that %ignore affects complete tokens at grammar parsing, not affecting the way your regexps will be matched (e.g., ignoring WS - short for White Spaces - would not prevent the lexer to consider them to tokenize the input).
Using the %ignore directive results in a cleaner grammar.
It’s especially important for the LALR(1) algorithm, because adding whitespace (or comments, or other extranous elements) explicitly in the grammar, harms its predictive abilities, which are based on a lookahead of 1.
Syntax:
%ignore <TERMINAL>
Yeah, I see … I was thinking about the actual meaning/implication of “ignoring the whitespaces” at tokenizing time, which in this case would limit (I was assuming implicitly this loss of expressivity if %ignore WS was working at char level, but now the behaviour of the directive in lark is clearer to me - it’s ignore at grammar level).
Thank you @MegaIng !