Simpler tokenization
See original GitHub issueCurrently, the need to write a tokenizer makes it hard to get started with Superpower. Almost prohibitively so, except for the most motivated newcomers 😃
Generating high-performance, general, tokenizers is a bit more than we can bite off here in the short term, but that doesn’t preclude us from making the experience better. At the expense of some raw performance, it’s possible to generate fairly useful tokenizers using TextParser<T>
s as recognizers.
In v2, this is the model I’d like to propose:
var tokenizer = new TokenizerBuilder<SExpressionToken>()
.Ignore(Span.WhiteSpace)
.Match(Character.EqualTo('('), SExpressionToken.LParen)
.Match(Character.EqualTo(')'), SExpressionToken.RParen)
.Match(Numerics.Integer, SExpressionToken.Number, requireDelimiters: true)
.Match(Character.Letter.IgnoreThen(Character.LetterOrDigit.AtLeastOnce()),
SExpressionToken.Atom, requireDelimiters: true)
.Ignore(Comment.ShellStyle)
.Build();
var tokens = tokenizer.TryTokenize("abc (123 def) # this is a comment");
Assert.True(tokens.HasValue);
Assert.Equal(5, tokens.Value.Count());
Compare this with the by-hand version in: https://github.com/datalust/superpower/blob/dev/test/Superpower.Tests/SExpressionScenario/SExpressionTokenizer.cs#L10 - at least 30 significant lines of fairly dense code, without even supporting comments.
The proposed TokenizerBuilder
can accept any text parser as a recognizer, and using the requireDelimiters
argument, can deal with the awkward "is null"
vs "isnull"
case using a one-token lookahead.
The downside is that tokenizer run-time increases linearly with respect to the number of matches attempted. This probably isn’t noticeable for small grammars like the one above, but larger grammars can do much better with a hand-written replacement. We might be able to extend the builder with some optimizations to claw this perf back, down the line, or add a table-based alternative.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:4
- Comments:18 (5 by maintainers)
Top GitHub Comments
@nblumhardt
TokenizerBuilder
is an amazing feature 👍 Yesterday i refactored the tokenizer of o custom DSL from 200 lines of code to easy readable 30 lines ofTokenizerBuilder
code in less than 40 minutes and just 2 of ~200 test are failing. It is really easy to get started!@Platzer thanks for the link - will try to check it out 😃
@SuperJMN removing whitespace definitely does make constructing the parser a bit simpler 👍