Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Implementing a CommonMark++ parser

See original GitHub issue

This is an issue to discuss whether / how to implement a CommonMark + directives parser for Sphinx, as @chrisjsewell and I had discussed earlier.

The problem

The recommonmark project piggy-backs on the commonmark-py project to parse markdown. It then defines a Sphinx parser that sub-classes the docutils parser and defines methods that convert the commonmark-py AST into docutils AST (https://github.com/readthedocs/recommonmark/blob/master/recommonmark/parser.py#L21).

Under the hood it’s still using docutils methods since they sub-class the docutils parser, and as a result there is some weird behavior (like nested_parse expecting rST in the content blocks).

One solution

@chrisjsewell proposed writing our own CommonMark -> docutils AST parser, and then adding on the syntax for roles and directives. This would be two things:

A Sphinx parser that reads in markdown, and uses:
Our own Praser that behaves like a docutils parser, but under the hood is utilizing a more modern state-machine software (https://github.com/pytransitions/transitions) to parse markdown.

The hope is that this parser would be easier to maintain, understand, and grow as we wished to support new syntax. It would be a collection of “markdown -> docutils AST” rules, rather than relying on an intermediate AST as the commonmark-py project does.

A question - could we continue using `commonmark-py`?

As I was looking through documentation, I am wondering whether we could still use the commonmark-py machinery to parse basic commonmark syntax, and then use our own statemachine parser to handle the “extra” grammar elements like roles and directives.

Basically, I’m wondering whether we could do the same thing that recommonmark does, but instead of sub-classing a docutils Parser, we sub-class a parser that knows how to parse only the subset of markdown that commonmark doesn’t cover.

If this were possible, I feel like we wouldn’t need to worry about re-writing the test suite of commonmark-py, and we could then focus only on the extra syntax needed for things like roles and directives. We could then also have a markdown parser under the hood for the nested_parse sections.

Note - it may also be illustrative to look at how the commonmark-py parser does its parsing - I believe that code starts here: https://github.com/readthedocs/commonmark.py/blob/c4c5b0df72961663060c65ed0858840b5e031b10/commonmark/blocks.py#L881

And the blocks.py module in general defines how they parse markdown…maybe we could re-use (or explicitly use) some of it…

I’m curious what @chrisjsewell thinks about this - mostly I am trying to find ways that we don’t have to write our own from-scratch markdown parser as I’m a bit worried about all the edge-cases we’ll have to consider 😃

Pros and cons of each markdown reader in Python

Find it here: https://github.com/ExecutableBookProject/meta/wiki/Resources:-Markdown-(MD)#markdown-parsers-in-python

Issue Analytics

State:
Created 4 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

chrisjsewellcommented, Feb 10, 2020

I’m dead set on mistletoe at this point. Yes, the maintenance is a con, but I think it is just so much more well structured and flexible than the other options.

In https://github.com/chrisjsewell/mistletoe/blob/myst/test.ipynb, you’ll see I have forked it, and with just a few code additions, I have implemented the majority of the markdown -> docutils bridge already, including capturing block token source code line ranges.

If you are in the office soon we can discuss?

1reaction

choldgrafcommented, Feb 10, 2020

Another option I hadn’t considered, but might be attractive especially if we don’t think any of these projects will work for us, is to use a Python wrapper to the cmark package. cmark is written in C and is super fast and well-maintained:

For one example, here’s a lightweight Python wrapper for this package:

https://pypi.org/project/paka.cmark/

It can be used like:

from paka import cmark
import xml.etree.ElementTree as ET

def parse_md_to_xml(md):
    xml = cmark.to_xml(md)
    tree = ET.fromstring(xml)
    return tree

tree = parse_md_to_xml(md)

(note that at least this wrapper library doesn’t output an AST, but it does output XML which we can parse into Python…we’d then need to do another loop to loop for extra syntax we’d want to support, since extensions are not natively supported (and anyway they’d need to be written in C))

This is about 2-3 times faster than mistune as well…

There is also commonmark.js which is the javascript implementation of the commonmark spec. Perhaps that could be utilized somehow as well?

Worth noting is that both cmark and commonmark.js are very well-maintained, given that they are the official reference implementations of commonmark

Top Results From Across the Web

CommonMark

If a CommonMark implementation does not already exist in your preferred environment or language, try implementing your own CommonMark parser.

commonmark/cmark - GitHub

It provides a shared library ( libcmark ) with functions for parsing CommonMark documents to an abstract syntax tree (AST), manipulating the AST,...

CommonMark: A Formal Specification For Markdown

Markdown is a powerful markup language that allows editing and formatting in plain text format that can then be parsed and rendered as...

Inline Parsing - CommonMark for PHP

There are two ways to implement custom inline syntax: Inline Parsers (covered here); Delimiter Processors. The difference between normal inlines and delimiter- ...

Week 10 - Software Tools & Techniques Lab (UCSD CSE15L)

Table of Contents Due Dates & Links Lab Tasks Not Writing Parsers Getting CommonMark Running an Example Using CommonMark for Our Task Review...