Feedback: My Experience with Nearley
See original GitHub issueAs background, I’d say I have good experience with regex, and I’ve hand-written simple parser / lexers, worked with ASTs and SSA representations in statically compiled languages.
For a recent project, I needed to parse an original DSL and going character-by-character on a string probably won’t work very well in Javascript. Besides, parser-generators are cool and replicable.
It wasn’t at all hard to get into writing my first grammar, and nearley playground especially is an awesome tool. On my way to having a working grammar I ran into a few gotchas I’d like to share.
1. I found the way |
works with postprocessors surprising
When I write animal -> "cow" | "sheep"
the nonterminal animal
can be "cow"
or "sheep"
. Makes sense so far. The grammar then parsing my corpus, I would write a postprocessor:
animal -> "cow" | "sheep" {% ([value]) => value.join("") %}`
If the input is "sheep"
, I, as expected get "sheep"
, but if the input is “cow” I instead get ["c", "o", "w"]
because the {% %}
preprocessor block associates with the last |
separated value, instead of the whole line as I initially expected.
I’d say, OK, clearly I’m not getting how the input array maps to the separate values of this nonterminal. I’ll console.log
the arguments.
animal -> "cow" | "sheep" {% (...args) => console.log(args) }
Now, if I input "cow"
, I get nothing at all in the console, but if I input "sheep"
I get ["s", "h", "e", "e", "p"]
, as expected. That made tracking down this issue very hard, as I was really running blind.
To make things simple, I would then just write, if I could all my rules as simple:
expression -> thing
expression -> thinga "or" thingb
# etc
2. The association of parsed values to postprocessor arguments is unclear
When I wrote my first statement that could, or could not have some value, I was unsure how the values within conditional blocks would associate with the arguments I got.
My initial intuition was that this would work like regex, where capture group index is not heirachical.
line -> statement ( white_space:+ optional_annotation):? (? white_space:+ comment)?: {%
([statement, white_space_1, optional_annotation, white_space_2, comment]) => statement
%}
When I realised there weren’t enough arguments, I tried to use deeper list deconstruction to access the values I imagined were inside the heirachical construction:
line -> stmt ( white_space:+ optional_annotation):? (? white_space:+ comment)?: {%
([stmt, [/*_*/, optional_annotation], [/*_*/, comment]]) => ({stmt})
%}
This then had ‘not an object’ type errors because the second argument wasn’t an array when it didn’t match anything, so I added defaults:
line -> stmt ( white_space:+ optional_annotation):? (? white_space:+ comment):? {%
([stmt, [/*_*/, optional_annotation] = [0,0], [/*_*/, comment] = [0,0]]) => ({stmt})
%}
This would ensure that incase argument 2 was undefined
, white_space_1
and optional_annotation
would return falsey values. This didn’t work either, because when an optional group doesn’t match anything it actually returns null
rather than undefined
. I had to fall back to using a group of if
statements to check for each optional block:
line -> stmt ( white_space:+ optional_annotation):? (? white_space:+ comment):? {%
([stmt, annotation, comment]) => {
let o = {stmt};
if (annotation) o.annotation = annotation[1];
if (comment) o.comment = comment[1];
}
%}
While I don’t necessarily think this is a bad way of doing things, there’s no canonical reference for syntax and what documentation that exists does not call out how this associativity works.
Moreover, the information on syntax that exists is spread through two different documents: ‘Writing a Parser’ with notable section ‘More Syntax: tips and tricks’, and ‘How to Grammar Good’.
These documents cover similar topics, but some syntax is only introduced in one and not the other, meaning you really do have to read ‘How to Grammar Good’ to understand the language syntax as best you can.
3. Tokenisation is often needed, but scarcely documented
I was happy to go from zero to writing a workable grammar in a few hours. However, I quickly ended up with slowness issues. This is something that seems to be very common with complex grammars, and the answer is always use an additional, plug-in tokeniser:
https://github.com/kach/nearley/issues/303#issuecomment-333339229 https://github.com/kach/nearley/issues/312#issuecomment-335267061 https://github.com/kach/nearley/issues/238#issuecomment-302972228 https://github.com/kach/nearley/issues/111#issuecomment-234822737 etc
While it’s not unreasonable or unexpected to need to use a tokenizer with a parser, none of the examples for any grammar in the documentation use a tokenizer, with the exception of the single and extremely simple grammar on the Tokenizers section. The only section on writing a parser that refers to this page is in the very last paragraph of ‘Using a Parser’.
What this meant for me personally is that I’d built a fairly complex grammar over a few days, which I then had to totally re-write to support the tokenization framework.
Conversion to the tokenization syntax is non-trivial, and this is exacerbated by it causing the meaning of nearley’s syntax to change. The impression I got from the tokenization article was that you could drop in tokenization essentially wherever you wanted and nearley would intelligently decide when tokenization is necessary. This is probably obvious to nearley’s developers – I’m not quite smart enough to be writing parser-generators – but it was unexpected to me that "hello"
would not match "hello"
anymore unless I explicitly wrote a token for it.
Thanks
I don’t mean this to be an existential criticism of the Nearley project. Nearly is super useful and has been a fun ride for me. I hope my feedback helps 😃
Issue Analytics
- State:
- Created 5 years ago
- Reactions:6
- Comments:5
Top GitHub Comments
I also found these problems. it’s easy to write the grammar, but getting an usable AST out of the post processor is very hard. It seems like there are many more [] than are really necessary. Often I want to have something like a separated list, say
list -> value ("," value):*
… basically bashing head against log statements and tests to try to figure out post processor@jameslaydigital I think you’ve misunderstood the purpose of this issue. I know how to use nearly, I am just highlighting ways that it is confusing.