Greediness of rules
See original GitHub issueIs it possible to adjust the greediness of rules with regex terminals?
My text to parse is this, for example:
OTHER STATEMENT
GO
CREATE TABLE mytable (
var1 float,
var2 float
)
GO
Here OTHER STATEMENT
is something I don’t really care about, but it could have multiple tokens across lines, quotation marks, parentheses, etc. All we know that it terminates with the word GO
. What I actually want to capture is this CREATE TABLE
statement. My attempt of a Lark code is this:
program: (statement "GO")* statement ["GO"]
statement: create_table_statement | other_statement
create_table_statement.1: "CREATE TABLE" table_name signature
table_name: CNAME
signature: "(" typed_variable ("," typed_variable)* ")"
typed_variable: variable type
variable: CNAME
type: CNAME
other_statement: /.+/s
%import common.CNAME
%import common.WS
%ignore WS
The problem though is that the whole text matches the other_statement
, because /.+/s
consumes the GO
keyword that is supposed to by the statement separator.
Is there a way within Lark to achieve what I want? I have a vague idea how to preprocess the text and remove the irrelevant statements before feeding it to Lark, but that may require some additional coding effort (making sure that “GO” appears not as a part of a quoted string).
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Oh, sorry, bad formatting. Something like
.+?(?=\sGO)
Absolutely, although it’s a less common use-case. We’ve been asked about it in the past, but I still haven’t figured out a way for Lark to give a good answer for this. It might be best to skip the unknown sections manually, and then only parse the structured parts. (you can use
interactive_parse
to get even more control over the parsing mechanism).?(?=\sGO)
however is a zero-width regexp, I’m not sure how it could be used.I understand that it’s best to have the complete grammar, but it’s also useful to be able to parse a document partially. And it seems to mostly work fine when the separator is included in the statement.