Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Code review/feedback: parsing dictionary formats with Nearley

See original GitHub issue

Hi all,

I was hoping to be pointed in the right direction of using Nearley for my particular use case (in order to grammar even more goodly).

Use case

I’m using Nearley to validate and parse a “sort of” domain-specific format called Toolbox format or “backslash-coded lexicons”. I say “sort of”, because there is a fair bit of project-to-project variation. I’m quite sure I’ll have to write project-specific grammars, so I’m hoping to make the process as streamlined as possible.

What each dictionary project have in common though is that the dictionaries are written as lines of key-value pairs (keys vary project to project: \word, \headword, \lx for lexeme, etc.). A simple example of a two-word dictionary in such a format might look like:

\word rouge
\partOfSpeech Adjective
\definition red

\word bonjour
\partOfSpeech Exclamation
\definition hello
\exampleFrench Il a dit 'bonjour'
\translationEnglish He said 'hello'

A [very hacky] Nearley grammar for this might look like:

dictionary  -> record:+ 					# {% prettify_records %}

record      -> word partOfSpeech definition example:*

word        -> _NL:*    "\\word"                      _ _ABNL

partOfSpeech-> _NL      "\\partOfSpeech"              _ validPartOfSpeech
    validPartOfSpeech   -> "Adjective" | "Exclamation"
	
definition  -> _NL      "\\definition"                _ _ABNL

example     -> exampleFrench translationEnglish
    exampleFrench       -> _NL "\\exampleFrench"      _ _ABNL
    translationEnglish  -> _NL "\\translationEnglish" _ _ABNL

_NL         -> "\n"     # New line
_ABNL       -> [^\n]:+  # All but new line
_           -> " "      # Space

Suppose an uncommented {% prettify_records %} then returns a JSON object:

{
    "dictionary" : [
        {
            "word" :        "rouge",
            "partOfSpeech": "Adjective",
            "definition" :  "red",
            "examples" : []
        },
        {
            "word" :         "bonjour",
            "partOfSpeech" : "Exclamation",
            "definition" :   "hello",
            "examples" : [
                {
                     "exampleFrench":       "Il a dit 'bonjour'",
                     "englishTranslation" : "He said 'hello'"
                }
            ]
        }
    ]
}

With the data structured in JSON, we can run queries, or make derived output formats such as a searchable web version of the dictionary or generate a printable PDF form using LaTeX (both through Hugo).

Question: where to from my basic/hacky grammar?

I think I understand that I probably should be using a lexer of some form to first process the key-value pairs. While I’ve looked at Moo and its documentation, I’m not quite sure how to think about lexing when the type of the token is meant to be derived from the data itself. As opposed to (https://nearley.js.org/docs/tokenizers):

@{%
const moo = require("moo");

const lexer = moo.compile({
  ws:     /[ \t]+/,
  number: /[0-9]+/,
  times:  /\*|x/
});
%}

# Pass your lexer object using the @lexer option:
@lexer lexer

# Use %token to match any token of that type instead of "token":
multiplication -> %number %ws %times %ws %number {% ([first, , , , second]) => first * second %}

where the lexer classifies the chunks into types of ws, number, or times, I’m guessing for my use case, I need to be able to derive these types from the backslash codes (e.g. '\partOfSpeech Adjective' => { "type" : "partOfSpeech", "value" : "Adjective" }), which I assume/hope will help simplify the grammar(s) I write to:

@lexer someCustomLexer

dictionary  -> record:+                                      # {% prettify_records %}
record      -> %word %partOfSpeech %definition %example:*

# No terminals down here!

Any hints/help towards a more sound way of parsing such formats would be highly appreciated.

Thanks!

Issue Analytics

State:
Created 6 years ago
Comments:10 (1 by maintainers)

Top GitHub Comments

2reactions

fauxneticiencommented, Aug 24, 2017

I don’t think #274 is relevant here.

Do you care in what order the fields appear? I assumed you didn’t, which perhaps is why my initial response wasn’t very helpful…

I do, actually! I should have been more explicit about this, given that it’s one of the key aims (to enforce that the project linguists enter the data in a certain order).

be able to use certain types of tokens as non-terminals in the Nearley syntax

Right, that’s what the part I quoted in the docs is talking about. Clearly we should explain that better! The idea is that you can write “\lx” in your grammar, to match a token with the text \lx.

Ah, that part makes sense now—this meant to say that I can mix literals in "..." and tokenised objects from the lexer %... in my grammar. I had thought I was obliged to use items from (and only from) the tokeniser output. Much clearer now!

This worked out really great, thanks! A bonus for us is also that the .ne file itself can be verbose enough for it to act as an data entry guide for the linguists, i.e. “a record is made up of the lexeme, and one or more parts of speech, each of which must be a valid part of speech: noun, etc.”.

If it helps to demo, please feel free to use my current grammar/test data, where lexeme and partOfSpeech make use of raw strings ("\\lx ", "\\ps "), tokens from Moo (%newline), and also other Nearley (non-)terminals (validPartOfSpeech).

grammar.ne:

@lexer lexer

dictionary ->
    record:+

record ->
    lexeme
    partOfSpeech:+

lexeme ->
    %newline:* "\\lx " %contents

partOfSpeech ->
    %newline   "\\ps " validPartOfSpeech
    
    validPartOfSpeech ->
        "Adverb" |
        "Adjective" |
        "Noun" |
        "Verb"

@{%
    const moo = require('moo');

    const lexer = moo.compile({
        newline:    { match: '\n', lineBreaks: true },
        key:        /\\[a-zA-Z]+ /,
        contents:   /.+$/
    });
%}

dictionary.txt:

\lx Red
\ps Adjective

\lx Record
\ps Noun
\ps Verb

\lx All
\ps Adjective
\ps Adverb
\ps Noun

I wouldn’t recommend it in this case, but if you’re really keen to have the type of a token change based on its value, you can always use Moo’s keyword feature.

Just out of curiosity/to learn more about Moo, what would be the disadvantage of using the keyword feature (especially if %unkownKey, rightfully, fails with a syntax error and tells the user about it)?

0reactions

fauxneticiencommented, Aug 25, 2017

In fact, fromCompiled is how we want everyone to be instantiating Parsers. The “grammar.ParserRules, grammar.ParserStart” idiom is a vestige from before @tjvr came and implemented lexer support in nearley.

Ah, gotcha—thanks @Hardmath123! Everything’s working as expected now.

Top Results From Across the Web

C2PA Technical Specification

includes the ISCC - International Standard Content Code. The ISCC is an identifier and fingerprint for digital assets that supports all major content...

Requests Documentation - Read the Docs

Requests is actively developed on GitHub, where the code is always available. ... What about the other HTTP request types: PUT, DELETE, ...

Home - nearley.js - JS Parsing Toolkit

nearley.js is a simple, fast, and powerful parser toolkit for JavaScript. ... You can find editor plug-ins for vim, Sublime Text, Atom, and...

SE-0257: Eliding commas from multiline expression lists

All review feedback should be either on this forum thread or, ... There is a lot of code out there that e.g. builds...

The impact of peer review on the contribution potential ... - NCBI

This study leverages open data from nearly 5,000 PeerJ ... to compare the sentiment of peer review feedback from reviewers who opt to...