Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Embedded language parser with context-dependant "boundaries"

See original GitHub issue

I hope the issue title isn’t too confusing. This is closely related to #803.

I’m trying to build a parser for a language like this:

FUNCTION x {
    SELECT * FROM t;
}

NAMESPACE y {
    FUNCTION z {
        SELECT * FROM t WHERE k = $1;
    }
}

In short: the “global” context is like a namespace, namespaces can contain other namespaces and functions and functions contain raw SQL code that’s passed onto an external parser.

With the help of #803 I found out about multi-mode parsers, but I can’t figure out how to apply them here. The { character can open both function and namespace ‘blocks’ so I can’t differentiate there. I also don’t see an efficient way to change mode on the FUNCTION or NAMESPACE keywords because there’s an identifier token in the middle (from the lexer’s point of view SQL could also be a bunch of identifiers). I’ve tried more things than I can remember, but the only solution that I haven’t explicitly seen failing and that I can make sense of in my head goes like this:

FUNCTION pushes the “function declaration mode”
In this mode only identifier and “function block open/close” tokens are used
This “function block opener” token pushes the “SQL mode”
In “SQL mode” there’s a single custom pattern matcher that matches everything up to an (unescaped) } and pops the “SQL mode” again
The lexer would now be in “function declaration mode” again, matching the “function block close” token (popping the “function declaration mode”)

Although this system makes sense in my head, I feel like it’s not actually going to work. It seems unnecessarily convoluted to me, especially the fact that I’d need 2 separate “block opener” tokens that have the same pattern (but different push-/pop-modes).

Am I just overthinking it and is this actually a valid solution? Now that I’ve written it out like this it doesn’t look so bad anymore, but I feel like I’m abusing the multi-mode functionality here. Is there a better solution that I should be aware of?

Issue Analytics

State:
Created 4 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

bd82commented, Aug 30, 2019

Yup, a ‘}’ inside a string literal would have to be escaped as well. This isn’t the “highest level of convenience” but it makes parsing (or lexing) a lot easier. Since curly braces are very rare in actual SQL code I’m willing to sacrifice that small convenience for a much simpler parsing approach.

If that ever becomes a problem you could replace the regExp with your own custom logic that is capable of ignoring the contents of string literals inside the SQL block.

0reactions

HoldYourWafflecommented, Aug 30, 2019

Does not /{(\}|[^}])*}/ regExp assume that the closing curly brace inside the block is always escaped? Is this always true? could for example a curly brace appear unescaped inside a string literal ?

Yup, a ‘}’ inside a string literal would have to be escaped as well. This isn’t the “highest level of convenience” but it makes parsing (or lexing) a lot easier. Since curly braces are very rare in actual SQL code I’m willing to sacrifice that small convenience for a much simpler parsing approach.

However it is possible to contribute more substantial features even without deep understanding of the library as long as those features flows have clear entry/exit points and some separation from the inner complexity.

I’ll take a look at some of the issues you linked when I have some more time on my hands, it certainly looks very interesting!

Top Results From Across the Web

Context Dependent Semantic Parsing: A Survey

In this survey, we investigate progress on the methods for the context dependent semantic parsing, together with the current datasets and.

Learning to Map Context-Dependent Sentences to Executable ...

We propose a context-dependent model to map utterances within an interaction to executable formal queries. To incorporate interaction his- tory, the model ...

Failed parsing lookbehind expressions · Issue #27 · bd82/regexp-to ...

Error: Unable to use "first char" lexer optimizations: Failed parsing: ... Embedded language parser with context-dependant "boundaries" ...

DP-Parse: Finding Word Boundaries from Raw Speech with ...

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words.

BERTrade: Using Contextual Embeddings to Parse Old French

We provide a comparative study of several strate- gies for obtaining such contextual embeddings.