Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strange lexer behaviour

See original GitHub issue

Describe the bug

I have found an incoherent behaviour in the lexer in a quite particular case. I’m writing a grammar to write a pseudo query language mixing words/tokens with AND, OR, quotes, and parenthesis. The main idea is that the AND and OR operators apply to just one token/quoted string, unless they are within a parenthesis expression, where we will expand the operands to all the tokens leftwise or rightwise. For example, A B AND C => the operands of AND would be “B” and “C” while (A B AND C) => the operands of AND would be “A B” and “C”

I’ve defined WORD as /\S*[^)]/ to avoid collisions with the ending parenthesis in case there is no space to define the token properly. Depending on whether I’m within a parenthesis or not and, if inside, whether there are one or more tokens, the regular expression is evaluated differently.

To Reproduce

The grammar and behaviour:

`from lark import Lark

parser = Lark(r"“”

start: expr+

expr: expr OP_OR and_expr
    | and_expr
    
and_expr: and_expr OP_AND operand
    | operand
    
operand: WORD
    | QUOTED_EXPR
    | par_expr
    
par_expr: PAR_OPEN broaden_expr PAR_CLOSE

broaden_expr: broaden_expr OP_OR broaden_and_expr
    | broaden_and_expr 

broaden_and_expr: broaden_and_expr OP_AND broaden_operand
    | broaden_operand
    
broaden_operand: WORD+       
    | QUOTED_EXPR
    | par_expr

OP_OR.10: "OR"
OP_AND.10: "AND"
OP_QUOTES.10: "\""
PAR_OPEN.10: "("
PAR_CLOSE.10: ")"
QUOTED_EXPR.10: /\".*?\"/                           // to make the regular expression greedy
WORD: /\S*[^)]/             

%import common.WS                                   // imports from terminal library
%ignore WS
""", parser="lalr")`

Then, with the following inputs: query=‘(operand AND whatever)’ :: OK query=‘(operand AND whatever dfa)’ :: OK query=‘(operand AND whatever dfa )’ :: OK query=‘(operand) AND (whatever)’ :: NOK => parses operand) and whatever) as the tokens query=‘(operand ) AND (whatever )’ :: OK => distinguishes the ) token query=‘(operand) AND (whatever dfa)’ :: NOK => parses operand) as token query='(operand ) AND (whatever dfa) :: OK => dfa and ) are parsed as different tokens

It seems that the + applied to WORD in broaden_operand affects the way the regex used for defining words is evaluated.

BTW, for my purposes I’ve solved it by changing the regex to directly /[^)\s]/+ but I thought that the behaviour should be notified to see whether it was the way the regex evaluation should work.

Issue Analytics

State:
Created 2 years ago
Comments:20 (10 by maintainers)

Top GitHub Comments

1reaction

cbobedcommented, Sep 20, 2021

Sure, I think (especially for people that had previous experience with other parser tools) that this would help in order to specify where the %ignore directive applies (bear in mind that in lark there is not so much separation between the lexer and the grammar parser as rules and terminal definitions are not so isolated from one each other as in other parsing tools).

Thank you!

%ignore

All occurrences of the terminal will be ignored, and won’t be part of the parse. Note that %ignore affects complete tokens at grammar parsing, not affecting the way your regexps will be matched (e.g., ignoring WS - short for White Spaces - would not prevent the lexer to consider them to tokenize the input).

Using the %ignore directive results in a cleaner grammar.

It’s especially important for the LALR(1) algorithm, because adding whitespace (or comments, or other extranous elements) explicitly in the grammar, harms its predictive abilities, which are based on a lookahead of 1.

Syntax:

%ignore <TERMINAL>

0reactions

cbobedcommented, Sep 20, 2021

Yeah, I see … I was thinking about the actual meaning/implication of “ignoring the whitespaces” at tokenizing time, which in this case would limit (I was assuming implicitly this loss of expressivity if %ignore WS was working at char level, but now the behaviour of the directive in lark is clearer to me - it’s ignore at grammar level).

Thank you @MegaIng !

Top Results From Across the Web

Strange flex behaviour - lexical analysis - Stack Overflow

When I parse something like 4{3d6, 1d5} everything works fine. But with 4{3d6,1d5} the scanner has a strange behaviour and misses the first...

[antlr-interest] missing tokens and strange behaviour ...

The lexer runs and produces all the the tokens, then the parser runs. ... [antlr-interest] missing tokens and strange behaviour regarding ...

Strange debugger behaviour in void-safe mode; attachment not ...

I have managed to compile one of my applications, and it starts to run... fairly early on, when parsing an object schema, there...

GUICoordMode=2 and GUICtrlCreateGroup = strange behaviour ...

Hi, I have a problem when using AutoItSetOption("GUICoordMode", 2) in conjonction with a GUI group created wth GUICtrlCreateGroup as an example, ...

Strange behaviour in hexadecimal numbers - lua-users.org

Hi all, I stumbled upon a weird behaviour in hexadecimal constants in Lua ... Looks like the lexer is tokenizing it as an...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Strange lexer behaviour

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

grammar files get opened an unnecessary amount of times, causing an enormous loading time when creating a parser

Using the LALR file cache on Python 2.7 fails with "unbound method exists() must be called with FS instance as first argument (got str instance instead)"