Discuss: Sequences
See original GitHub issueIs your request related to a specific problem you’re having?
I don’t have time to go back and cite all the examples, but see a lot of existing grammars and see our lengthily discussions on sequencing in the LaTex thread. The problem is when want to match a specific sequence of modes (which all may be complex in and of themselves). A made up BNF like example:
<list> ::= *special <term> <term> <opt-whitespace> <list>
But now imagine that instead of just <tag>
that these might be more HTML like constructs, with attributes, strings, special characters, etc… so 4 modes start to appear:
- key tag (
<list ...>
) - assignment (
::=
) - modifiers (
*special
) - then multiple rule tags
The solution you’d prefer / feature you’d like to see added…
I proposing adding a sequence
to sit beside contains
. While contains is orderless, sequence would be sequential. I’m simplifying the rules below so you can see the bigger picture, but each rule could possibly have it’s own begin
, end
, contains
, etc… For the first pass I might restrict some things like starts
, endsParent
, endsWithParent
just to shrink the problem space… and then find out later if those things are truly needed… but otherwise these would be full modes in their own right and have the same output/processing behavior as other modes when they are active.
// rule
{
begin: /<.*>/,
sequence: [
{ match: /<.*>/ },
{ match: /::=/ },
{ match: /\*special/, optional: true },
{ match: /<.*>/, multiple: true }
],
end: mode.MATCH_NOTHING_RE,
illegal: /\S/,
}
begin
could be optional and if so it could borrow the begin from the first match in the sequence. The first match would obviously then need to be mandatory (notoptional
).end
by default would attempt to “immediate terminate” the sequence (as it does withcontains
) - though it’s worth discussing if this is the correct default for sequences and how we might change this without being inconsistent- this behavior could be changed just by setting and
end
rule or usingMATCH_NOTHING_RE
if there is truly no end rule
- this behavior could be changed just by setting and
- if one wants to mandate a full sequence
illegal
could be used (such as using illegal above to flag non-spaces). This would cause an illegal error to be thrown for incomplete sequences.
worth discussing if this is the correct default for sequences
It feels like specifying end
above is a bit annoying… but if we did not then any space we encountered between the tags we care about would cause the sequence to terminate. This type of scenario (spaces or non-content in between things care about) is common enough that we should try to find a nice way to handle it without needing tons of additional modes for whitespace. Having separate regex for whitespace is already really bugging me with our new multi-match support.
How would it work in practice
Once a mode with sequence
was entered the parser would go into a sequential mode loop:
- loop
- look for current item in sequence (or end or illegal)
- if illegal found, raise; if end found, end mode
- when item found, start that mode
- if item singular, increment position in sequence
- when item not found and optional (or multiple and already matched once), increment position
- if no more items, end mode
This is overly simplified of course… since if you had 2 optionals for example the parser would be using a multi-regex that was scanning for the next 3 items in the sequence (plus begin and illegal)… since the 3rd rule would be elgible to match at any position due to the first 2 rules being optionals.
This adds complexity but I’m not sure a (one item after another, no repeats, all mandatory) solves a lot of real problems… I remember LaTex definitely had optionals and such things.
Any alternative solutions you considered…
More sugar on top of our existing starts
stuff, but I find it all quite convoluted to think and reason about… and there is very hard to understand behavior with endsParent
and starts
. Since changing starts
would likely break a bunch of existing grammars I think we need some real new behavior rather than just sugar.
Sugar also becomes incredibly hard to debug since the end user sees only the sugar and doesn’t understand the potentially incredibly complex rules being generated behind the scenes. So far I’ve tried to keep our sugar minimal and doing simple things.
There are other ways of writing it syntactically, such as reusing contains but have a flag to say it’s a sequence. I think I like this els off the top of my head though.
contains: [ ... ],
sequential: true,
Just to name one example.
Additional context…
Note that no where are we talking about branching or back-tracking. This is not being discussed. Sequences either complete (after matching every item), terminate early with an incomplete sequence (the end matcher is triggered) or raise an error (they hit an illegal). Illegal is how you would specify the “the full sequence is required”.
Once we find the start of a sequence, we are committed to that sequence. This is not tackling problems like “It might be sequence X or sequence Y” - which would require backtracking. For some grammars these situations might be handler with creative use of optionals and multiples.
Also, in simple cases a strong begin
regex with a look-ahead could help making sure the right sequence was selected.
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (12 by maintainers)
It’s not just regex (they just make the examples easier to read), these can be full modes with 100s of submodes, etc… If it was just regex then someone would use the new multi-match support we just added… this is needed for much more complex rulesets.
All optional arguments in LaTeX work like this, for example
Here
[short title]
is optional and inthe fact that
\section
is not directly (excepting most whitespace) followed by[
signals that the optional argument is not present. In particular, matching[containing brackets]
as belonging to\section
would be very wrong.I agree.
Yes. At least from my standpoint (i.e. what do I need for LaTeX), that’s exactly what it should mean (the
cb
would simply be out of place there).