Embedded language parser with context-dependant "boundaries"
See original GitHub issueI hope the issue title isn’t too confusing. This is closely related to #803.
I’m trying to build a parser for a language like this:
FUNCTION x {
SELECT * FROM t;
}
NAMESPACE y {
FUNCTION z {
SELECT * FROM t WHERE k = $1;
}
}
In short: the “global” context is like a namespace, namespaces can contain other namespaces and functions and functions contain raw SQL code that’s passed onto an external parser.
With the help of #803 I found out about multi-mode parsers, but I can’t figure out how to apply them here. The {
character can open both function and namespace ‘blocks’ so I can’t differentiate there. I also don’t see an efficient way to change mode on the FUNCTION
or NAMESPACE
keywords because there’s an identifier token in the middle (from the lexer’s point of view SQL could also be a bunch of identifiers).
I’ve tried more things than I can remember, but the only solution that I haven’t explicitly seen failing and that I can make sense of in my head goes like this:
FUNCTION
pushes the “function declaration mode”- In this mode only identifier and “function block open/close” tokens are used
- This “function block opener” token pushes the “SQL mode”
- In “SQL mode” there’s a single custom pattern matcher that matches everything up to an (unescaped)
}
and pops the “SQL mode” again - The lexer would now be in “function declaration mode” again, matching the “function block close” token (popping the “function declaration mode”)
Although this system makes sense in my head, I feel like it’s not actually going to work. It seems unnecessarily convoluted to me, especially the fact that I’d need 2 separate “block opener” tokens that have the same pattern (but different push-/pop-modes).
Am I just overthinking it and is this actually a valid solution? Now that I’ve written it out like this it doesn’t look so bad anymore, but I feel like I’m abusing the multi-mode functionality here. Is there a better solution that I should be aware of?
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (15 by maintainers)
Top GitHub Comments
If that ever becomes a problem you could replace the regExp with your own custom logic that is capable of ignoring the contents of string literals inside the SQL block.
Yup, a ‘}’ inside a string literal would have to be escaped as well. This isn’t the “highest level of convenience” but it makes parsing (or lexing) a lot easier. Since curly braces are very rare in actual SQL code I’m willing to sacrifice that small convenience for a much simpler parsing approach.
I’ll take a look at some of the issues you linked when I have some more time on my hands, it certainly looks very interesting!