Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Contextual lexer matching terminal that can't be reached in context

See original GitHub issue

I think this is a bug, but it may be possible I’m misunderstanding how the contextual lexer works.

The grammar I have has tokens based on the context, but until now it seemed to work ok using lalr with the contextual lexer, but after I added the return_stmt it seems it started to do some wrong choices on how to lex the code.

In this case, where the grammar should parse a NAME_CONT it’s parsing a DEC_NUMBER.

Let me give some more info:

The grammar is trying to match the code:

*** Functions ***
My Function 1
    While var 2
        var 3

and it’s matching the 1 from My Function 1 as a DEC_NUMBER instead of a NAME_CONT (even though a DEC_NUMBER isn’t valid at that context and a NAME_CONT is (the identifier rule is: identifier: NAME (NAME|NAME_CONT|WS)* and it matched the NAME for My and NAME_CONT for Function, yet it then failed to match the 1 as NAME_CONT and ended up matching as a DEC_NUMBER, which isn’t valid in the context.

The place where the rules should be matching is at the func_name – i.e.:

file_input: (_NEWLINE | root_stmt)*
?root_stmt: func_block
func_block: func_header WS? _NEWLINE (func_stmt)*
func_header: BLOCK_START "functions"i  BLOCK_END
func_stmt: func_name func_suite
func_name: identifier
?identifier: NAME  (NAME|NAME_CONT|WS)*
NAME: /[^\d\W]\w*/
NAME_CONT: /\w+/
WS: /\s+/

That alone works, but it fails in the actual grammar:

test_lark_contextual_lexer_issue.py.txt

This happened right after I added the return_stmt!

So, in that same grammar just removing the return_stmt the lexer seems to do the right thing – I’m not sure why though…

The docs say that:

The contextual lexer communicates with the parser, and uses the parser’s lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals.

Given that, I think it’s a bug in that the lexer is not considering the context as it should in this case – otherwise, I’m probably misunderstanding the approach on how it decides how to lex based on the context or the grammar can reach a DEC_NUMBER through some means that I can’t see (as far as I see it the func_suite needs a _NEWLINE to start, so, it should not be possible to get to a DEC_NUMBER from there).

Issue Analytics

State:
Created 3 years ago
Comments:18 (18 by maintainers)

Top GitHub Comments

1reaction

fabiozcommented, Nov 2, 2020

In this case, where the grammar should parse a NAME_CONT it’s parsing a DEC_NUMBER

Is it actually parsing it as DEC_NUMBER, or only saying so in the UnexpectedToken error?

It’s actually parsing as a DEC_NUMBER.

The full exception trace is:

Traceback (most recent call last):
  File "X:\lark\lark\parsers\lalr_parser.py", line 86, in feed_token
    action, arg = states[state][token.type]
KeyError: 'DEC_NUMBER'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "X:\vscode-robot\rflang\tests\rflang_tests\test_lark_contextual_lexer_issue.py", line 97, in <module>
    parse(
  File "X:\vscode-robot\rflang\tests\rflang_tests\test_lark_contextual_lexer_issue.py", line 87, in parse
    tree = lark_spec.parse(source_code)
  File "X:\lark\lark\lark.py", line 493, in parse
    return self.parser.parse(text, start=start)
  File "X:\lark\lark\parser_frontends.py", line 138, in parse
    return self._parse(start, self.make_lexer(text))
  File "X:\lark\lark\parser_frontends.py", line 73, in _parse
    return self.parser.parse(input, start, *args)
  File "X:\lark\lark\parsers\lalr_parser.py", line 35, in parse
    return self.parser.parse(*args)
  File "X:\lark\lark\parsers\lalr_parser.py", line 129, in parse
    return self.parse_from_state(parser_state)
  File "X:\lark\lark\parsers\lalr_parser.py", line 145, in parse_from_state
    raise e
  File "X:\lark\lark\parsers\lalr_parser.py", line 136, in parse_from_state
    state.feed_token(token)
  File "X:\lark\lark\parsers\lalr_parser.py", line 89, in feed_token
    raise UnexpectedToken(token, expected, state=state, puppet=None)
lark.exceptions.UnexpectedToken: Unexpected token Token('DEC_NUMBER', '1') at line 3, column 13.
Expected one of: 
	* WS
	* _NEWLINE
	* NAME
	* NAME_CONT

0reactions

erezshcommented, Nov 3, 2020

I think it should try to tokenize the next words using the remainder of the identifier construct (NAME_CONT|WS) and then the tokens based on rules that can happen at that context

That may be possible, but it is an extraneous complication just to shift the error from one place to another, that will have real performance implications. So, I don’t plan to do it. But I appreciate the suggestion.

Top Results From Across the Web

LALR's contextual lexer - Lark documentation - Read the Docs

This example demonstrates the power of LALR's contextual lexer, by parsing a toy configuration language. The terminals NAME and VALUE overlap. They can...

CUP User's Manual

New position values and propagation,; Parser now returns a value,; Terminal precedence declarations and; Rule contextual precedence assignment. Lexical ...

Context-Aware Scanning for Parsing Extensible Languages

This paper introduces new parsing and context-aware scanning al- gorithms in which the scanner uses contextual information to dis- ambiguate lexical syntax. The ......

Ubuntu Manpage: lark - Lark Documentation

So at each point, the lexer only matches the subgroup of terminals that are ... CYK Parser A CYK parser can parse any...

Fusing Lexing and Parsing

2 Background: lexer and parser combinators. We present lexer-parser fusion using a parser combinator library, flap (fused lexing and parsing).