Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feedback: My Experience with Nearley

See original GitHub issue

As background, I’d say I have good experience with regex, and I’ve hand-written simple parser / lexers, worked with ASTs and SSA representations in statically compiled languages.

For a recent project, I needed to parse an original DSL and going character-by-character on a string probably won’t work very well in Javascript. Besides, parser-generators are cool and replicable.

It wasn’t at all hard to get into writing my first grammar, and nearley playground especially is an awesome tool. On my way to having a working grammar I ran into a few gotchas I’d like to share.

1. I found the way `|` works with postprocessors surprising

When I write animal -> "cow" | "sheep" the nonterminal animal can be "cow" or "sheep". Makes sense so far. The grammar then parsing my corpus, I would write a postprocessor:

animal -> "cow" | "sheep" {% ([value]) => value.join("") %}`

If the input is "sheep", I, as expected get "sheep", but if the input is “cow” I instead get ["c", "o", "w"] because the {% %} preprocessor block associates with the last | separated value, instead of the whole line as I initially expected.

I’d say, OK, clearly I’m not getting how the input array maps to the separate values of this nonterminal. I’ll console.log the arguments.

animal -> "cow" | "sheep" {% (...args) => console.log(args) }

Now, if I input "cow", I get nothing at all in the console, but if I input "sheep" I get ["s", "h", "e", "e", "p"], as expected. That made tracking down this issue very hard, as I was really running blind.

To make things simple, I would then just write, if I could all my rules as simple:

expression -> thing
expression -> thinga "or" thingb
# etc

2. The association of parsed values to postprocessor arguments is unclear

When I wrote my first statement that could, or could not have some value, I was unsure how the values within conditional blocks would associate with the arguments I got.

My initial intuition was that this would work like regex, where capture group index is not heirachical.

line -> statement ( white_space:+ optional_annotation):? (? white_space:+ comment)?: {%
   ([statement, white_space_1, optional_annotation, white_space_2, comment]) => statement
%}

When I realised there weren’t enough arguments, I tried to use deeper list deconstruction to access the values I imagined were inside the heirachical construction:

line -> stmt ( white_space:+ optional_annotation):? (? white_space:+ comment)?: {%
   ([stmt, [/*_*/, optional_annotation], [/*_*/, comment]]) => ({stmt})
%}

This then had ‘not an object’ type errors because the second argument wasn’t an array when it didn’t match anything, so I added defaults:

line -> stmt ( white_space:+ optional_annotation):? (? white_space:+ comment):? {%
   ([stmt, [/*_*/, optional_annotation] = [0,0], [/*_*/, comment] = [0,0]]) => ({stmt})
%}

This would ensure that incase argument 2 was undefined, white_space_1 and optional_annotation would return falsey values. This didn’t work either, because when an optional group doesn’t match anything it actually returns null rather than undefined. I had to fall back to using a group of if statements to check for each optional block:

line -> stmt ( white_space:+ optional_annotation):? (? white_space:+ comment):? {%
   ([stmt, annotation, comment]) => {
      let o = {stmt};
      if (annotation) o.annotation = annotation[1];
      if (comment) o.comment = comment[1];
   }
%}

While I don’t necessarily think this is a bad way of doing things, there’s no canonical reference for syntax and what documentation that exists does not call out how this associativity works.

Moreover, the information on syntax that exists is spread through two different documents: ‘Writing a Parser’ with notable section ‘More Syntax: tips and tricks’, and ‘How to Grammar Good’.

These documents cover similar topics, but some syntax is only introduced in one and not the other, meaning you really do have to read ‘How to Grammar Good’ to understand the language syntax as best you can.

3. Tokenisation is often needed, but scarcely documented

I was happy to go from zero to writing a workable grammar in a few hours. However, I quickly ended up with slowness issues. This is something that seems to be very common with complex grammars, and the answer is always use an additional, plug-in tokeniser:

https://github.com/kach/nearley/issues/303#issuecomment-333339229 https://github.com/kach/nearley/issues/312#issuecomment-335267061 https://github.com/kach/nearley/issues/238#issuecomment-302972228 https://github.com/kach/nearley/issues/111#issuecomment-234822737 etc

While it’s not unreasonable or unexpected to need to use a tokenizer with a parser, none of the examples for any grammar in the documentation use a tokenizer, with the exception of the single and extremely simple grammar on the Tokenizers section. The only section on writing a parser that refers to this page is in the very last paragraph of ‘Using a Parser’.

What this meant for me personally is that I’d built a fairly complex grammar over a few days, which I then had to totally re-write to support the tokenization framework.

Conversion to the tokenization syntax is non-trivial, and this is exacerbated by it causing the meaning of nearley’s syntax to change. The impression I got from the tokenization article was that you could drop in tokenization essentially wherever you wanted and nearley would intelligently decide when tokenization is necessary. This is probably obvious to nearley’s developers – I’m not quite smart enough to be writing parser-generators – but it was unexpected to me that "hello" would not match "hello" anymore unless I explicitly wrote a token for it.

Thanks

I don’t mean this to be an existential criticism of the Nearley project. Nearly is super useful and has been a fun ride for me. I hope my feedback helps 😃

Issue Analytics

State:
Created 5 years ago
Reactions:6
Comments:5

Top GitHub Comments

2reactions

dominictarrcommented, Sep 22, 2019

I also found these problems. it’s easy to write the grammar, but getting an usable AST out of the post processor is very hard. It seems like there are many more [] than are really necessary. Often I want to have something like a separated list, say list -> value ("," value):* … basically bashing head against log statements and tests to try to figure out post processor

1reaction

Zemnmezcommented, Oct 4, 2019

@jameslaydigital I think you’ve misunderstood the purpose of this issue. I know how to use nearly, I am just highlighting ways that it is confusing.