Code review/feedback: parsing dictionary formats with Nearley
See original GitHub issueHi all,
I was hoping to be pointed in the right direction of using Nearley for my particular use case (in order to grammar even more goodly).
Use case
I’m using Nearley to validate and parse a “sort of” domain-specific format called Toolbox format or “backslash-coded lexicons”. I say “sort of”, because there is a fair bit of project-to-project variation. I’m quite sure I’ll have to write project-specific grammars, so I’m hoping to make the process as streamlined as possible.
What each dictionary project have in common though is that the dictionaries are written as lines of key-value pairs (keys vary project to project: \word
, \headword
, \lx
for lexeme, etc.). A simple example of a two-word dictionary in such a format might look like:
\word rouge
\partOfSpeech Adjective
\definition red
\word bonjour
\partOfSpeech Exclamation
\definition hello
\exampleFrench Il a dit 'bonjour'
\translationEnglish He said 'hello'
A [very hacky] Nearley grammar for this might look like:
dictionary -> record:+ # {% prettify_records %}
record -> word partOfSpeech definition example:*
word -> _NL:* "\\word" _ _ABNL
partOfSpeech-> _NL "\\partOfSpeech" _ validPartOfSpeech
validPartOfSpeech -> "Adjective" | "Exclamation"
definition -> _NL "\\definition" _ _ABNL
example -> exampleFrench translationEnglish
exampleFrench -> _NL "\\exampleFrench" _ _ABNL
translationEnglish -> _NL "\\translationEnglish" _ _ABNL
_NL -> "\n" # New line
_ABNL -> [^\n]:+ # All but new line
_ -> " " # Space
Suppose an uncommented {% prettify_records %}
then returns a JSON object:
{
"dictionary" : [
{
"word" : "rouge",
"partOfSpeech": "Adjective",
"definition" : "red",
"examples" : []
},
{
"word" : "bonjour",
"partOfSpeech" : "Exclamation",
"definition" : "hello",
"examples" : [
{
"exampleFrench": "Il a dit 'bonjour'",
"englishTranslation" : "He said 'hello'"
}
]
}
]
}
With the data structured in JSON, we can run queries, or make derived output formats such as a searchable web version of the dictionary or generate a printable PDF form using LaTeX (both through Hugo).
Question: where to from my basic/hacky grammar?
I think I understand that I probably should be using a lexer of some form to first process the key-value pairs. While I’ve looked at Moo and its documentation, I’m not quite sure how to think about lexing when the type
of the token is meant to be derived from the data itself. As opposed to (https://nearley.js.org/docs/tokenizers):
@{%
const moo = require("moo");
const lexer = moo.compile({
ws: /[ \t]+/,
number: /[0-9]+/,
times: /\*|x/
});
%}
# Pass your lexer object using the @lexer option:
@lexer lexer
# Use %token to match any token of that type instead of "token":
multiplication -> %number %ws %times %ws %number {% ([first, , , , second]) => first * second %}
where the lexer classifies the chunks into types of ws
, number
, or times
, I’m guessing for my use case, I need to be able to derive these types from the backslash codes (e.g. '\partOfSpeech Adjective'
=> { "type" : "partOfSpeech", "value" : "Adjective" }
), which I assume/hope will help simplify the grammar(s) I write to:
@lexer someCustomLexer
dictionary -> record:+ # {% prettify_records %}
record -> %word %partOfSpeech %definition %example:*
# No terminals down here!
Any hints/help towards a more sound way of parsing such formats would be highly appreciated.
Thanks!
Issue Analytics
- State:
- Created 6 years ago
- Comments:10 (1 by maintainers)
Top GitHub Comments
I do, actually! I should have been more explicit about this, given that it’s one of the key aims (to enforce that the project linguists enter the data in a certain order).
Ah, that part makes sense now—this meant to say that I can mix literals in
"..."
and tokenised objects from the lexer%...
in my grammar. I had thought I was obliged to use items from (and only from) the tokeniser output. Much clearer now!This worked out really great, thanks! A bonus for us is also that the
.ne
file itself can be verbose enough for it to act as an data entry guide for the linguists, i.e. “a record is made up of the lexeme, and one or more parts of speech, each of which must be a valid part of speech: noun, etc.”.If it helps to demo, please feel free to use my current grammar/test data, where
lexeme
andpartOfSpeech
make use of raw strings ("\\lx "
,"\\ps "
), tokens from Moo (%newline
), and also other Nearley (non-)terminals (validPartOfSpeech
).grammar.ne
:dictionary.txt
:Just out of curiosity/to learn more about Moo, what would be the disadvantage of using the keyword feature (especially if
%unkownKey
, rightfully, fails with a syntax error and tells the user about it)?Ah, gotcha—thanks @Hardmath123! Everything’s working as expected now.