Difference in matching of recursive "First"/"Ordered Choice" compared to top-down parsers
See original GitHub issueHello, I have re-implemented the Pika algorithm in Python (https://github.com/maxfischer2781/bootpeg) but stumbled over a case where the Pika algorithm seems to behave notably different than a left-recursive Packrat. Not sure whether that’s more appropriate for the paper or reference implementation, but it seems relevant for your work.
The context are left-recursive rules for left-associative binary operations. Consider for example Python’s +
/-
grammar (translated to PEG from source grammar):
sum <- sum '+' term / sum '-' term / term
Notably, since this grammar is used for applying separate actions for +
and -
there are two almost identical choices of entire binary operation rules. For reference, the rule used in the paper/reference implementation internalises the choice to a single rule, avoiding the situation:
E0 <- (E0 ('+'/'-') E1) / E1;
The issue can be traced well with the above sum
rule and an input such as this:
a + b - c
The important part is that the left-most operator term corresponds to the left-most (“lowest ClauseIdx”) grammar term.
The desired parse tree, as matched by a left-recursive Packrat parser, looks like this:
(sum) '-' (term)
/ \
(sum) '+' (term) 'c'
/ \
(term) 'b'
/
'a'
Basically, the parser matches term='a'
allowing it to match sum=(term='a') '+' term='b'
and finally sum=(sum='a + b') '-' term='c'
. While the “- sum
” rule has higher ClauseIdx than the “+ sum
” rule, it is a valid match because it contains the lower ClauseIdx rule.
The Pika algorithm seems incapable of matching like this: Since a “- sum
” match has higher ClauseIdx, it does not replace the previous “+ sum
” match. Match:isBetterThan
only allows to select by length or ClauseIdx, but not by nesting.
Without having completely worked out the logic or tested it yet, it seems to me there are two potential approaches:
- Introduce a Generation in addition to the ClauseIdx. A match replaces another iff it has better Generation, same Generation and better ClauseIdx, or same Generation and same ClauseIdx and better length.
- Introdcue a sub-clause check. A match replaces another iff it is a parent match of the other, or has better ClauseIdx, or same ClauseIdx and better length.
Since this only applies to left-recursive choices, it can probably be optimised when preparing a parser: There are “recursive choice”, “choice” and “plain” clauses; only recursive choices need to be tracked for nesting.
Issue Analytics
- State:
- Created 2 years ago
- Comments:50 (35 by maintainers)
Top GitHub Comments
@lukehutch just wanted to say Thanks for helping me (and our new visitor) along. I think you’ve cleared up my confusions – feel free to close the issue.
OK, I understand what’s going on here… this is probably what you already explained, but:
The function used to only update memo table entries if they improve on the previous match for a given clause at a given input position is too simplistic. Currently it is (from the class
Match
):In this case,
Sum '+' Term
hasfirstMatchingSubClauseIdx = 0
, andSum '-' Term
hasfirstMatchingSubClauseIdx = 1
. So even thoughfully contains the subtree of
once the parent clause
(Sum '+' Term) / (Sum '-' Term) / Term
is parsed, the shorter(Sum '+' Term)
will always overwrite the longer and deeper tree(Sum '+' Term)
.The solution here would be to add an earlier check in the
thisMatch.isBetterThan(otherMatch)
function that will return false ifthisMatch
is a sub-tree ofotherMatch
.Searching
otherMatch
for the root node ofthisMatch
would make memoization takeO(d^2)
in the depth of the parse tree, however! So that’s not a great solution.Another solution might be to add a
depth
field toMatch
instances, which starts at0
for leaves, and adds1
as each level is added to the match tree, bottom-up. But at a node with more than one child, the depth would be the max of all the child depths, plus1
. Then inisBetterThan
, a deeper tree is determined to be a better match than a shallower tree. This should work, since we’re only talking about the depth from the leaf for a specific clause at a specific character position, so if the depth increases, then necessarily the parsing must have gone around an additional loop in the grammar.At least I think that’s the correct solution. What do you think?