Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible bug in 'standard' lexer: the longest token is matched incorrectly

See original GitHub issue

When I use the standard lexer (default for lalr parser) sometimes it determines the longest match incorrectly.

Consider the following code:

import lark

grammar = r'''
start: BINOP | ASSIGNMENT_OP

ASSIGNMENT_OP: "="
    | "+="

BINOP: "*"
    | "/"
    | "%" 
    | "+"
    | "-"
    | "long_operator"

%import common.WS
%ignore WS
'''

parser = lark.Lark(grammar, parser='lalr')
code = '+='
print(list(parser.lex(code)))

It should print [Token(ASSIGNMENT_OP, '+=')] because the longest match is += and the documentation says that in this case the longest match should be returned. However, the code prints [Token(BINOP, '+'), Token(ASSIGNMENT_OP, '=')], which is incorrect.

I suppose the lexer might be comparing not the lengths of actually matching terminals (+ and += in this case) but the lengths of the longest possible options (long_operator and +=)

P.S. lark version is 0.7.0, installed as a package python-lark-parser in ArchLinux

Issue Analytics

State:
Created 4 years ago
Comments:11 (4 by maintainers)

Top GitHub Comments

1reaction

kodo-ppcommented, Apr 26, 2019

OK, now I see I was wrong. Thank you for your explanation

0reactions

erezshcommented, Apr 26, 2019

Maybe because there are no regexps in my code?

Incorrect. a | b | c is a regexp. Whenever you define a terminal, that is not a plain string, it’s a regexp.

If you want to group different terminals together, that’s what rules are for.

If you think you can provide a better explanation, feel free to write one, and I might include it in the docs.

Top Results From Across the Web

The lexer chooses the wrong Token - antlr4 - Stack Overflow

I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together...

Practical parsing with Flex and Bison - begriffs.com

An important subtlety is how Lex handles multiple eligible matches. It picks the longest possible match available, and in the case of a...

2. Lexical analysis — Python 3.11.1 documentation

This chapter describes how the lexical analyzer breaks a file into tokens. ... Where ambiguity exists, a token comprises the longest possible string...

Chapter 1. Lex and Yacc - O'Reilly

Lex helps you by taking a set of descriptions of possible tokens and ... Lex executes the action for the longest possible match...

Ubuntu Manpage: lark - Lark Documentation

It's possible to bypass the dynamic lexing, and use the regular Earley parser with a traditional lexer, that tokenizes as an independent first...