Parse-time specification of start non-terminal
See original GitHub issueI’m working on a grammar for a query language that has several entities that can recursively use each other. As an example, imagine SQL and entities like an entire subquery, a projection item, a constraint in the WHERE clause - any of those entities can contain the other entities inside. The parser needs to be able to parse a string given what kind of entity it is, i.e. I ultimately need a method that parses any valid select query, another method that parses any valid projection item, another one that parses a constraint and so on - I think I’ll need at least 7 such entities.
I currently see two options on how to do that.
- Create mutliple parsers, each time specifying the same grammar but different start non-terminal. This has the disadvantage of compiling the grammar and creating the parser each time, which is a problem considering just one parser takes quarter of a second to build and with 7 the startup time increases too much.
- Create a grammar with a special starting non-terminal that determines the type of entity that the parsed string should represent based on some short prefix, like here (just an illustration, I didn’t even check this is a valid grammar):
start: "!select" select
| "!projection" projection
| "!where" where
// bellow is what the actual grammar might look like
select: "select" projection+ "where" where
projection: "column" | select
where: "constraint" | "exists(" select ")"
The above is most likely what I’ll end up doing, but it’s ugly and I will have to account for this artificially added prefix in determining the position of parse errors in user supplied inputs.
A better solution I could imagine is the ability to supply the start non-terminal when calling the parse method. I don’t know much of the theory behind the other parser frontends, but for LALR this should be possible just by initializing the parser state stack with a state determined by mapping the name of the start non-terminal supplied by the caller.
This is probably too big of a change for me to attempt to do myself any time soon. Is this something that you would like lark to have one day? Do you see any better workaround, or is the input prefix hack the best one can do?
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
Thanks, @petee-d, I’m happy you had such a good experience with Lark! And I’m flattered to receive such compliments. This feature request just had the right balance of being small and simple enough, but also a little tricky, which is my favorite combination. Also, I think it’s an improvement on Lark’s API.
If you do end up talking about Lark, let me know if you have any questions for me. And if possible, please send me the link afterwards, so I can brag about it 😃
This feature sounds useful also for, e.g., testing the correct operation of atransformer for certain sub-parts of the grammar. Currently, I create a separate Lark instance for each test which, although relatively easy to implement with pytest fixtures, is quite wasteful/slow.
Maybe a better API would be
to ensure this feature remains optional and to not complicate parse calls for the “default” usage?