Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add possibility to jump in a state and push new state on stack

See original GitHub issue

In the current implementation, you can only have pop: true, push: "someState" or next: "someOtherState" in a stateful lexer.

Imagine you are in state = "currentState" but could set the state to next: "continueHereState" and at the same time push: "parseSomething". The next time you pop, it would result in "continueHereState" instead of going back to "currentState".

For me, this was useful to parse for example function calls in JavaScript with recursive arrays and objects. Something like:

identifier(["some", "array", {}, 123], {"object": {"values": ["a", "b"]}, "whatever": false})

I’ve just tweaked these lines https://github.com/no-context/moo/blob/24b23ca961232df15f870f9c8db1c933f2a31e21/moo.js#L484-L486 to this:

    if (group.pop) {
      this.popState()
    } else if (group.push && group.next) {
      this.setState(group.next)
      this.pushState(group.push)
    } else if (group.push) {
      this.pushState(group.push)
    } else if (group.next) {
      this.setState(group.next)
    }

Is this something you might want a PR for? Would it make sense to allow next inside a pop as well (resulting in setState directly after the pop)?

Issue Analytics

State:
Created 5 years ago
Comments:8

Top GitHub Comments

2reactions

nathancommented, Sep 23, 2018

This is a fairly common problem people have when writing lexers and parsers. You generally want your lexer to be as dumb and permissive as possible, i.e., it should know nothing about syntax except what the tokens are and the absolute minimum necessary to distinguish among them (in your example, the ability to distinguish between regular text and JavaScript code). I’d recommend writing your lexer like this:

const lexer = moo.states({
  main: {
    label: {match: /#/, next: 'label'},
    text: moo.fallback,
  },
  label: {
    call: {match: /\w+\(/, value: s => s.slice(0, -1), next: 'call'},
    name: {match: /\w+/, next: 'main'},
  },
  call: {
    comma: ',',
    colon: ':',
    lbrace: '{',
    rbrace: '}',
    lbracket: '[',
    rbracket: ']',
    rparen: {match: ')', next: 'main'},
    true: 'true',
    false: 'false',
    null: 'null',
    ws: {match: /\s+/, lineBreaks: true},
    number: /-?(?:\d|[1-9]\d+)(?:\.\d+)?(?:[eE][-+]?\d+)?/,
    string: /"(?:\\["bfnrt/\\]|\\u[a-fA-F0-9]{4}|[^"\\])*"/,
  },
})

lexer.reset(`what a #neat #thing() to #look({"hi":null,"blubb":{}}, [1, [null, []], 1], "hello", 123, "blubb") at`)

That gives you a token stream like this:

text what a 
label #
name neat
text  
label #
call thing
rparen )
text  to 
label #
call look
lbrace {
string "hi"
colon :
null null
comma ,
string "blubb"
colon :
lbrace {
rbrace }
rbrace }
comma ,
ws  
lbracket [
number 1
comma ,
ws  
lbracket [
null null
comma ,
ws  
lbracket [
rbracket ]
rbracket ]
comma ,
ws  
number 1
rbracket ]
comma ,
ws  
string "hello"
comma ,
ws  
number 1
number 2
number 3
comma ,
ws  
string "blubb"
rparen )
text  at

The reason your lexer should be permissive and un-clever is twofold:

You don’t end up duplicating your work. Your parser is going to encode the full syntax of the language anyway (e.g., that every { must be matched by a } and contain key-value pairs), so there’s no reason your lexer needs to know that too, and you can save yourself some time and maintenance effort by not writing the language syntax out twice. Also, parsers are designed to encode structural information (whereas lexers are designed to encode character-based information), so you’ll find it much easier to describe the structural features of the language in a parser (e.g., lbrace (string colon value (comma string colon value)*)? rbrace instead of every state that starts with object in your example).
You can give better error messages. When you’re lexing, the only information you have is: in the current state, the remainder of the source file doesn’t start with a valid token—that’s not a lot to work with. When you make your lexer overly permissive, your parser can give more informed feedback by talking about language constructs at the token and parse tree levels instead of at the character level (e.g., “expected , or ) after argument, but got number” instead of “unexpected 4”).

0reactions

tjvrcommented, Sep 29, 2018

That’s right: I don’t think we want to add a feature like this at this time. Moo is intended to be used with a parser of some kind, and we’re not planning to add parser-like features to it. (If someone was using Moo with a parser, and could demonstrate a use case that required this, then we’d think about it again.)

Good luck with your project! 😊