Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Difference in matching of recursive "First"/"Ordered Choice" compared to top-down parsers

See original GitHub issue

Hello, I have re-implemented the Pika algorithm in Python (https://github.com/maxfischer2781/bootpeg) but stumbled over a case where the Pika algorithm seems to behave notably different than a left-recursive Packrat. Not sure whether that’s more appropriate for the paper or reference implementation, but it seems relevant for your work.

The context are left-recursive rules for left-associative binary operations. Consider for example Python’s +/- grammar (translated to PEG from source grammar):

sum <- sum '+' term / sum '-' term / term

Notably, since this grammar is used for applying separate actions for + and - there are two almost identical choices of entire binary operation rules. For reference, the rule used in the paper/reference implementation internalises the choice to a single rule, avoiding the situation:

E0 <- (E0 ('+'/'-') E1) / E1;

The issue can be traced well with the above sum rule and an input such as this:

a + b - c

The important part is that the left-most operator term corresponds to the left-most (“lowest ClauseIdx”) grammar term.

The desired parse tree, as matched by a left-recursive Packrat parser, looks like this:

          (sum) '-' (term)
            /           \
    (sum) '+' (term)    'c'
      /         \
  (term)        'b'
    /
  'a'

Basically, the parser matches term='a' allowing it to match sum=(term='a') '+' term='b' and finally sum=(sum='a + b') '-' term='c'. While the “- sum” rule has higher ClauseIdx than the “+ sum” rule, it is a valid match because it contains the lower ClauseIdx rule.

The Pika algorithm seems incapable of matching like this: Since a “- sum” match has higher ClauseIdx, it does not replace the previous “+ sum” match. Match:isBetterThan only allows to select by length or ClauseIdx, but not by nesting.

Without having completely worked out the logic or tested it yet, it seems to me there are two potential approaches:

Introduce a Generation in addition to the ClauseIdx. A match replaces another iff it has better Generation, same Generation and better ClauseIdx, or same Generation and same ClauseIdx and better length.
Introdcue a sub-clause check. A match replaces another iff it is a parent match of the other, or has better ClauseIdx, or same ClauseIdx and better length.

Since this only applies to left-recursive choices, it can probably be optimised when preparing a parser: There are “recursive choice”, “choice” and “plain” clauses; only recursive choices need to be tracked for nesting.

Issue Analytics

State:
Created 2 years ago
Comments:50 (35 by maintainers)

Top GitHub Comments

1reaction

maxfischer2781commented, May 10, 2022

@lukehutch just wanted to say Thanks for helping me (and our new visitor) along. I think you’ve cleared up my confusions – feel free to close the issue.

1reaction

lukehutchcommented, Jun 4, 2021

OK, I understand what’s going on here… this is probably what you already explained, but:

The function used to only update memo table entries if they improve on the previous match for a given clause at a given input position is too simplistic. Currently it is (from the class Match):

    /**
     * Compare this {@link Match} to another {@link Match} of the same {@link Clause} type and start position.
     * 
     * @return true if this {@link Match} is a better match than the other {@link Match}.
     */
    public boolean isBetterThan(Match other) {
        if (other == this) {
            return false;
        }
        // An earlier subclause match in a First clause is better than a later subclause match
        // A longer match (i.e. a match that spans more characters in the input) is better than a shorter match
        return (memoKey.clause instanceof First // 
                && this.firstMatchingSubClauseIdx < other.firstMatchingSubClauseIdx) //
                || this.len > other.len;
    }

In this case, Sum '+' Term has firstMatchingSubClauseIdx = 0, and Sum '-' Term has firstMatchingSubClauseIdx = 1. So even though

└─Sum '-' Term : 2+5 : "a+b-c"
  ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+3 : "a+b"
  │ └─add:(Sum '+' Term) : 2+3 : "a+b"
  │   ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+1 : "a"
  │   │ └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 2+1 : "a"
  │   │   └─sym:[a-z]+ : 2+1 : "a"
  │   │     └─[a-z] : 2+1 : "a"
  │   ├─'+' : 3+1 : "+"
  │   └─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
  │     └─sym:[a-z]+ : 4+1 : "b"
  │       └─[a-z] : 4+1 : "b"
  ├─'-' : 5+1 : "-"
  └─Term <- num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
    └─sym:[a-z]+ : 6+1 : "c"
      └─[a-z] : 6+1 : "c"

fully contains the subtree of

└─Sum '+' Term : 2+3 : "a+b"
  ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+1 : "a"
  │ └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 2+1 : "a"
  │   └─sym:[a-z]+ : 2+1 : "a"
  │     └─[a-z] : 2+1 : "a"
  ├─'+' : 3+1 : "+"
  └─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
    └─sym:[a-z]+ : 4+1 : "b"
      └─[a-z] : 4+1 : "b"

once the parent clause (Sum '+' Term) / (Sum '-' Term) / Term is parsed, the shorter (Sum '+' Term) will always overwrite the longer and deeper tree (Sum '+' Term).

The solution here would be to add an earlier check in the thisMatch.isBetterThan(otherMatch) function that will return false if thisMatch is a sub-tree of otherMatch.

Searching otherMatch for the root node of thisMatch would make memoization take O(d^2) in the depth of the parse tree, however! So that’s not a great solution.

Another solution might be to add a depth field to Match instances, which starts at 0 for leaves, and adds 1 as each level is added to the match tree, bottom-up. But at a node with more than one child, the depth would be the max of all the child depths, plus 1. Then in isBetterThan, a deeper tree is determined to be a better match than a shallower tree. This should work, since we’re only talking about the depth from the leaf for a specific clause at a specific character position, so if the depth increases, then necessarily the parsing must have gone around an additional loop in the grammar.

At least I think that’s the correct solution. What do you think?

Top Results From Across the Web

Difference between Recursive Predictive Descent Parser and ...

Recursive Descent Parser is a top-down method of syntax analysis in which a ... Now parser matches all input letters in an ordered...

SI413: Top-down - Predictive and recursive-descent parsers

Recursive descent parsing is a different approach to top-down parsing for LL(1) grammars. Predictive parses and the bottom-up parsers we will describe later ......

Compiler Design - Top-Down Parser - Tutorialspoint

Recursive descent is a top-down parsing technique that constructs the parse tree from the top and the input is read from left to...

Top-Down Parsing

Top-down – easier to understand and program manually ... Recursive Descent Parsing -. Example ... This will match but + after T1 will...

Recursive-Descent Parsing

Recursive -descent parsing is one of the simplest parsing techniques that is used in practice. Recursive-descent parsers are also called top-down parsers, ...