question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strange lexer behaviour

See original GitHub issue

Describe the bug

I have found an incoherent behaviour in the lexer in a quite particular case. I’m writing a grammar to write a pseudo query language mixing words/tokens with AND, OR, quotes, and parenthesis. The main idea is that the AND and OR operators apply to just one token/quoted string, unless they are within a parenthesis expression, where we will expand the operands to all the tokens leftwise or rightwise. For example, A B AND C => the operands of AND would be “B” and “C” while (A B AND C) => the operands of AND would be “A B” and “C”

I’ve defined WORD as /\S*[^)]/ to avoid collisions with the ending parenthesis in case there is no space to define the token properly. Depending on whether I’m within a parenthesis or not and, if inside, whether there are one or more tokens, the regular expression is evaluated differently.

To Reproduce

The grammar and behaviour:

`from lark import Lark

parser = Lark(r"“”

start: expr+

expr: expr OP_OR and_expr
    | and_expr
    
and_expr: and_expr OP_AND operand
    | operand
    
operand: WORD
    | QUOTED_EXPR
    | par_expr
    
par_expr: PAR_OPEN broaden_expr PAR_CLOSE

broaden_expr: broaden_expr OP_OR broaden_and_expr
    | broaden_and_expr 

broaden_and_expr: broaden_and_expr OP_AND broaden_operand
    | broaden_operand
    
broaden_operand: WORD+       
    | QUOTED_EXPR
    | par_expr

OP_OR.10: "OR"
OP_AND.10: "AND"
OP_QUOTES.10: "\""
PAR_OPEN.10: "("
PAR_CLOSE.10: ")"
QUOTED_EXPR.10: /\".*?\"/                           // to make the regular expression greedy
WORD: /\S*[^)]/             

%import common.WS                                   // imports from terminal library
%ignore WS
""", parser="lalr")`

Then, with the following inputs: query=‘(operand AND whatever)’ :: OK query=‘(operand AND whatever dfa)’ :: OK query=‘(operand AND whatever dfa )’ :: OK query=‘(operand) AND (whatever)’ :: NOK => parses operand) and whatever) as the tokens query=‘(operand ) AND (whatever )’ :: OK => distinguishes the ) token query=‘(operand) AND (whatever dfa)’ :: NOK => parses operand) as token query='(operand ) AND (whatever dfa) :: OK => dfa and ) are parsed as different tokens

It seems that the + applied to WORD in broaden_operand affects the way the regex used for defining words is evaluated.

BTW, for my purposes I’ve solved it by changing the regex to directly /[^)\s]/+ but I thought that the behaviour should be notified to see whether it was the way the regex evaluation should work.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:20 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
cbobedcommented, Sep 20, 2021

Sure, I think (especially for people that had previous experience with other parser tools) that this would help in order to specify where the %ignore directive applies (bear in mind that in lark there is not so much separation between the lexer and the grammar parser as rules and terminal definitions are not so isolated from one each other as in other parsing tools).

Thank you!


%ignore

All occurrences of the terminal will be ignored, and won’t be part of the parse. Note that %ignore affects complete tokens at grammar parsing, not affecting the way your regexps will be matched (e.g., ignoring WS - short for White Spaces - would not prevent the lexer to consider them to tokenize the input).

Using the %ignore directive results in a cleaner grammar.

It’s especially important for the LALR(1) algorithm, because adding whitespace (or comments, or other extranous elements) explicitly in the grammar, harms its predictive abilities, which are based on a lookahead of 1.

Syntax:

%ignore <TERMINAL>


0reactions
cbobedcommented, Sep 20, 2021

Yeah, I see … I was thinking about the actual meaning/implication of “ignoring the whitespaces” at tokenizing time, which in this case would limit (I was assuming implicitly this loss of expressivity if %ignore WS was working at char level, but now the behaviour of the directive in lark is clearer to me - it’s ignore at grammar level).

Thank you @MegaIng !

Read more comments on GitHub >

github_iconTop Results From Across the Web

Strange flex behaviour - lexical analysis - Stack Overflow
When I parse something like 4{3d6, 1d5} everything works fine. But with 4{3d6,1d5} the scanner has a strange behaviour and misses the first...
Read more >
[antlr-interest] missing tokens and strange behaviour ...
The lexer runs and produces all the the tokens, then the parser runs. ... [antlr-interest] missing tokens and strange behaviour regarding ...
Read more >
Strange debugger behaviour in void-safe mode; attachment not ...
I have managed to compile one of my applications, and it starts to run... fairly early on, when parsing an object schema, there...
Read more >
GUICoordMode=2 and GUICtrlCreateGroup = strange behaviour ...
Hi, I have a problem when using AutoItSetOption("GUICoordMode", 2) in conjonction with a GUI group created wth GUICtrlCreateGroup as an example, ...
Read more >
Strange behaviour in hexadecimal numbers - lua-users.org
Hi all, I stumbled upon a weird behaviour in hexadecimal constants in Lua ... Looks like the lexer is tokenizing it as an...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found