Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parser specification recommendations

See original GitHub issue

Hey Alex,

Thanks for putting this up on HN! I threw together a really basic parser last night and ran into a couple things that I think are worth clarifying:

1. The spec should require that every item is a line.

That means each line must end with a trailing \n. This helps resolve ambiguity in the specification so far. This also implies that every item, including the last item in the file, needs a trailing \n to parse. That’s pretty standard for just about any text editor and many other text file formats, so it should be fine.

2. Regarding annotations in the following form:

Items:
  - Item Y
        [I belong to Item Y]
  - [I belong to Item Z]
          Item Z

I think the second case, an annotation “belonging to” it’s child is problematic. My recommendation is to drop it, because I think it’s difficult for humans to understand, let alone a parser.

If you did not want to drop that, then I would propose it be reinterpreted as follows:

Items:
  - Item Y
        [I belong to Item Y]
  - [I am an item with no value, only this annotation]
          I'm a child of this value-less item.

Again I think just requiring that a line have a value makes more sense, but if you feel it’s critical to keep the format shown above then I think a “valueless” item makes more sense than inverting the relationship of values and annotations. It’s confusing to both humans and machines.

3. Annotation types:

UPDATE: based on the discussion below I ended up reconsidering a bunch of these details and instead landed on there being only one kind of annotation, an object like {key: "optional", value: "required"}. Tasks only make sense as a special type of Item. Thus any readers who made it this far can skip past this part 😃

In your guidebook you show examples like this:

I'm a plain old item 
   [and I'm an annotation]

and

The Crying of Lot 49
   [author: Thomas Pynchon]
   [publication year: 1966]
   [publisher: J. B. Lippincott & Co.]

But in your sample area your examples show that annotations stored as a hash, with the “index” as the keys.

This several cases unclear.

How should a note with no “index” (ie something before a colon) be stored?
What happens if two annotations are given with the same “index”?

BTW as an aside I think you might want to call those “tags” or “keys” bc. “index” is a little confusing, but… that’s not critical.

Combining this with the other annotation types, it becomes difficult to reason about how the underlying data should be represented. I would propose the following:

There are three kinds of Annotations: note, index, and task.
All annotations are enclosed in square brackets [...]
A task is a square bracket enclosing exactly one character: [ ], [x], [✓], followed by any amount of text and terminated by a \n
An “index” (again, name is weird lol), is an annotation with characters separated by a colon: [index: content]
An “index” may occur only once per item. If an index is repeated, the last occurrence will be used.
Any other annotation is considered a note.

That would result in the following:

Big Item
  [this is just a note]
  [this is another note]
  [author: me]
  [category: code example]
  [category: margin example]
  [x] Put this on Github
  [ ] Get it adopted?
  Everything above is an annotation on Big Item, but I am a child.
  Of course big item can have multiple children.

And the JSON representation of this would look like this (omitting the raw stuff):

{
  "value": "Big Item",
  "annotations": {
    "notes": ["this is just a note", "this is another note"],
    "indices": {
      "author": "me",
      "category": "margin example"
    },
    "tasks": [
      { "done": true, "value": "Put this on Github" },
      { "done": false, "value": "Get this adopted?" },
    ]
  },
  "children": [
    { "value": "Everything above is an annotation on Big Item, but I am a child." },
    { "value": "Of course big item can have multiple children."}
  ]
}

I have a few other thoughts but those were the big ones and I know this is a lot to digest, so let me know what you think about all the above. Thanks for sharing your project, this is very cool 😃

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:10 (8 by maintainers)

Top GitHub Comments

2reactions

vlmutolocommented, May 11, 2020

Additionally, the parser should probably strip leading and trailing whitespace in annotation values so that [a: b] is parsed as a and b instead of a and •b.

1reaction

burlesonacommented, May 13, 2020

@mtsknn and others on this thread, I got far enough with my own implementation to have a working parser that can read Margin and write JSON. It’s not baked enough for wide distribution yet, but it has tests and covers all the cases we’ve talked about. I went with the key, value structure for annotations we discussed above.

The code is here: https://github.com/burlesona/margin-rb

The divergences from Alex’s implementation that I’m aware of ended up being as follows:

Every valid item must end in a newline, which means the document itself must end in a newline if there is an item on the last line.
Items can be one of two types, “item” or “task.” To represent this I added a ‘type’ field on the Item’s JSON representation. This is mostly informational and avoids users of the parsed data needing to do additional checks to figure out if an item is a task or not.
Items which are tasks have a boolean done field indicating whether they are done.
The value field on each item is cleaned of any leading or trailing decoration, including whitespace. Thus, the value field of a Task does not include the leading “checkbox” annotation.
Annotations are represented as objects with a required value and an optional key. In Alex’s implementation the key and value of what he calls an “index” (an annotation with a colon in it) are not machine parsed.
If the value of an annotation is strictly numeric, it is parsed into a number, not a string.

Note that I realized when farther into implementing this that it doesn’t really make sense for a task to be considered an annotation, rather a task is just an item with an extra done field.

To make this easy for consumers to work with I added the type field on items to indicate if it’s a regular item or a task.

Happy to hear any feedback you all have. I’ve got reasonable test coverage now but will likely add tests for more cases soon, as well as a CLI.

Top Results From Across the Web

Understanding Success Criterion 4.1.1: Parsing ... - W3C

However, exact parsing requirements vary amongst markup languages, and most non XML-based languages do not explicitly define requirements for well formedness.

Patterns and parsing techniques for requirements specification1

In this paper, we describe a series of steps involved in the proposal of a new controlled natural language for requirements specification, called...

13.2 Parsing HTML documents - HTML Standard - whatwg

This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the ...

From Natural Language Specifications to Program Input Parsers

We present a method for automatically generating input parsers from English specifications of input file formats. We use a Bayesian generative model to...

Patterns and parsing techniques for requirements specification

In this paper, we describe a series of steps involved in the proposal of a new controlled natural language for requirements specification, ...