question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Simpler tokenization

See original GitHub issue

Currently, the need to write a tokenizer makes it hard to get started with Superpower. Almost prohibitively so, except for the most motivated newcomers 😃

Generating high-performance, general, tokenizers is a bit more than we can bite off here in the short term, but that doesn’t preclude us from making the experience better. At the expense of some raw performance, it’s possible to generate fairly useful tokenizers using TextParser<T>s as recognizers.

In v2, this is the model I’d like to propose:

var tokenizer = new TokenizerBuilder<SExpressionToken>()
    .Ignore(Span.WhiteSpace)
    .Match(Character.EqualTo('('), SExpressionToken.LParen)
    .Match(Character.EqualTo(')'), SExpressionToken.RParen)
    .Match(Numerics.Integer, SExpressionToken.Number, requireDelimiters: true)
    .Match(Character.Letter.IgnoreThen(Character.LetterOrDigit.AtLeastOnce()),
        SExpressionToken.Atom, requireDelimiters: true)
    .Ignore(Comment.ShellStyle)
    .Build();

var tokens = tokenizer.TryTokenize("abc (123 def) # this is a comment");
Assert.True(tokens.HasValue);
Assert.Equal(5, tokens.Value.Count());

Compare this with the by-hand version in: https://github.com/datalust/superpower/blob/dev/test/Superpower.Tests/SExpressionScenario/SExpressionTokenizer.cs#L10 - at least 30 significant lines of fairly dense code, without even supporting comments.

The proposed TokenizerBuilder can accept any text parser as a recognizer, and using the requireDelimiters argument, can deal with the awkward "is null" vs "isnull" case using a one-token lookahead.

The downside is that tokenizer run-time increases linearly with respect to the number of matches attempted. This probably isn’t noticeable for small grammars like the one above, but larger grammars can do much better with a hand-written replacement. We might be able to extend the builder with some optimizations to claw this perf back, down the line, or add a table-based alternative.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:4
  • Comments:18 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Platzercommented, Mar 7, 2018

@nblumhardt TokenizerBuilder is an amazing feature 👍 Yesterday i refactored the tokenizer of o custom DSL from 200 lines of code to easy readable 30 lines of TokenizerBuilder code in less than 40 minutes and just 2 of ~200 test are failing. It is really easy to get started!

1reaction
nblumhardtcommented, Mar 7, 2018

@Platzer thanks for the link - will try to check it out 😃

@SuperJMN removing whitespace definitely does make constructing the parser a bit simpler 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenization in NLP: Types, Challenges, Examples, Tools
White Space Tokenization. The simplest way to tokenize text is to use whitespace within a string as the “delimiter” of words. This can...
Read more >
What is Tokenization | Methods to Perform Tokenization
Learn unique methods for performing tokenization in NLP using python. Get started with text processing using various techniques.
Read more >
5 Simple Ways to Tokenize Text in Python | by The PyCoach
1. Simple tokenization with .split. As we mentioned before, this is the simplest method to perform tokenization in Python. If you type .split() ......
Read more >
nltk.tokenize.simple module
Simple Tokenizers. These tokenizers divide strings into substrings using the string split() method. When tokenizing using a particular delimiter string, ...
Read more >
Tokenization | Identification for Development - ID4D
Tokenization Tokenization substitutes a sensitive identifier (e.g., ... In general, tokenization is often simpler and cheaper to implement than encryption ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found