question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible bug in 'standard' lexer: the longest token is matched incorrectly

See original GitHub issue

When I use the standard lexer (default for lalr parser) sometimes it determines the longest match incorrectly.

Consider the following code:

import lark

grammar = r'''
start: BINOP | ASSIGNMENT_OP

ASSIGNMENT_OP: "="
    | "+="

BINOP: "*"
    | "/"
    | "%" 
    | "+"
    | "-"
    | "long_operator"

%import common.WS
%ignore WS
'''

parser = lark.Lark(grammar, parser='lalr')
code = '+='
print(list(parser.lex(code)))

It should print [Token(ASSIGNMENT_OP, '+=')] because the longest match is += and the documentation says that in this case the longest match should be returned. However, the code prints [Token(BINOP, '+'), Token(ASSIGNMENT_OP, '=')], which is incorrect.

I suppose the lexer might be comparing not the lengths of actually matching terminals (+ and += in this case) but the lengths of the longest possible options (long_operator and +=)

P.S. lark version is 0.7.0, installed as a package python-lark-parser in ArchLinux

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
kodo-ppcommented, Apr 26, 2019

OK, now I see I was wrong. Thank you for your explanation

0reactions
erezshcommented, Apr 26, 2019

Maybe because there are no regexps in my code?

Incorrect. a | b | c is a regexp. Whenever you define a terminal, that is not a plain string, it’s a regexp.

If you want to group different terminals together, that’s what rules are for.

If you think you can provide a better explanation, feel free to write one, and I might include it in the docs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The lexer chooses the wrong Token - antlr4 - Stack Overflow
I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together...
Read more >
Practical parsing with Flex and Bison - begriffs.com
An important subtlety is how Lex handles multiple eligible matches. It picks the longest possible match available, and in the case of a...
Read more >
2. Lexical analysis — Python 3.11.1 documentation
This chapter describes how the lexical analyzer breaks a file into tokens. ... Where ambiguity exists, a token comprises the longest possible string...
Read more >
Chapter 1. Lex and Yacc - O'Reilly
Lex helps you by taking a set of descriptions of possible tokens and ... Lex executes the action for the longest possible match...
Read more >
Ubuntu Manpage: lark - Lark Documentation
It's possible to bypass the dynamic lexing, and use the regular Earley parser with a traditional lexer, that tokenizes as an independent first...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found