Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Simpler tokenization

See original GitHub issue

Currently, the need to write a tokenizer makes it hard to get started with Superpower. Almost prohibitively so, except for the most motivated newcomers 😃

Generating high-performance, general, tokenizers is a bit more than we can bite off here in the short term, but that doesn’t preclude us from making the experience better. At the expense of some raw performance, it’s possible to generate fairly useful tokenizers using TextParser<T>s as recognizers.

In v2, this is the model I’d like to propose:

var tokenizer = new TokenizerBuilder<SExpressionToken>()
    .Ignore(Span.WhiteSpace)
    .Match(Character.EqualTo('('), SExpressionToken.LParen)
    .Match(Character.EqualTo(')'), SExpressionToken.RParen)
    .Match(Numerics.Integer, SExpressionToken.Number, requireDelimiters: true)
    .Match(Character.Letter.IgnoreThen(Character.LetterOrDigit.AtLeastOnce()),
        SExpressionToken.Atom, requireDelimiters: true)
    .Ignore(Comment.ShellStyle)
    .Build();

var tokens = tokenizer.TryTokenize("abc (123 def) # this is a comment");
Assert.True(tokens.HasValue);
Assert.Equal(5, tokens.Value.Count());

Compare this with the by-hand version in: https://github.com/datalust/superpower/blob/dev/test/Superpower.Tests/SExpressionScenario/SExpressionTokenizer.cs#L10 - at least 30 significant lines of fairly dense code, without even supporting comments.

The proposed TokenizerBuilder can accept any text parser as a recognizer, and using the requireDelimiters argument, can deal with the awkward "is null" vs "isnull" case using a one-token lookahead.

The downside is that tokenizer run-time increases linearly with respect to the number of matches attempted. This probably isn’t noticeable for small grammars like the one above, but larger grammars can do much better with a hand-written replacement. We might be able to extend the builder with some optimizations to claw this perf back, down the line, or add a table-based alternative.

Issue Analytics

State:
Created 6 years ago
Reactions:4
Comments:18 (5 by maintainers)

Top GitHub Comments

1reaction

Platzercommented, Mar 7, 2018

@nblumhardt TokenizerBuilder is an amazing feature 👍 Yesterday i refactored the tokenizer of o custom DSL from 200 lines of code to easy readable 30 lines of TokenizerBuilder code in less than 40 minutes and just 2 of ~200 test are failing. It is really easy to get started!

1reaction

nblumhardtcommented, Mar 7, 2018

@Platzer thanks for the link - will try to check it out 😃

@SuperJMN removing whitespace definitely does make constructing the parser a bit simpler 👍

Top Results From Across the Web

Tokenization in NLP: Types, Challenges, Examples, Tools

White Space Tokenization. The simplest way to tokenize text is to use whitespace within a string as the “delimiter” of words. This can...

What is Tokenization | Methods to Perform Tokenization

Learn unique methods for performing tokenization in NLP using python. Get started with text processing using various techniques.

5 Simple Ways to Tokenize Text in Python | by The PyCoach

1. Simple tokenization with .split. As we mentioned before, this is the simplest method to perform tokenization in Python. If you type .split() ......

nltk.tokenize.simple module

Simple Tokenizers. These tokenizers divide strings into substrings using the string split() method. When tokenizing using a particular delimiter string, ...

Tokenization | Identification for Development - ID4D

Tokenization Tokenization substitutes a sensitive identifier (e.g., ... In general, tokenization is often simpler and cheaper to implement than encryption ...