Implementing a CommonMark++ parser
See original GitHub issueThis is an issue to discuss whether / how to implement a CommonMark + directives parser for Sphinx, as @chrisjsewell and I had discussed earlier.
The problem
The recommonmark
project piggy-backs on the commonmark-py
project to parse markdown. It then defines a Sphinx parser that sub-classes the docutils
parser and defines methods that convert the commonmark-py AST into docutils AST (https://github.com/readthedocs/recommonmark/blob/master/recommonmark/parser.py#L21).
Under the hood it’s still using docutils methods since they sub-class the docutils parser, and as a result there is some weird behavior (like nested_parse
expecting rST in the content blocks).
One solution
@chrisjsewell proposed writing our own CommonMark -> docutils AST parser, and then adding on the syntax for roles and directives. This would be two things:
- A Sphinx parser that reads in markdown, and uses:
- Our own Praser that behaves like a docutils parser, but under the hood is utilizing a more modern state-machine software (https://github.com/pytransitions/transitions) to parse markdown.
The hope is that this parser would be easier to maintain, understand, and grow as we wished to support new syntax. It would be a collection of “markdown -> docutils AST” rules, rather than relying on an intermediate AST as the commonmark-py
project does.
A question - could we continue using commonmark-py
?
As I was looking through documentation, I am wondering whether we could still use the commonmark-py
machinery to parse basic commonmark syntax, and then use our own statemachine parser to handle the “extra” grammar elements like roles and directives.
Basically, I’m wondering whether we could do the same thing that recommonmark
does, but instead of sub-classing a docutils Parser, we sub-class a parser that knows how to parse only the subset of markdown that commonmark doesn’t cover.
If this were possible, I feel like we wouldn’t need to worry about re-writing the test suite of commonmark-py
, and we could then focus only on the extra syntax needed for things like roles and directives. We could then also have a markdown parser under the hood for the nested_parse
sections.
Note - it may also be illustrative to look at how the commonmark-py parser does its parsing - I believe that code starts here: https://github.com/readthedocs/commonmark.py/blob/c4c5b0df72961663060c65ed0858840b5e031b10/commonmark/blocks.py#L881
And the blocks.py module in general defines how they parse markdown…maybe we could re-use (or explicitly use) some of it…
I’m curious what @chrisjsewell thinks about this - mostly I am trying to find ways that we don’t have to write our own from-scratch markdown parser as I’m a bit worried about all the edge-cases we’ll have to consider 😃
Pros and cons of each markdown reader in Python
Find it here: https://github.com/ExecutableBookProject/meta/wiki/Resources:-Markdown-(MD)#markdown-parsers-in-python
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
I’m dead set on mistletoe at this point. Yes, the maintenance is a con, but I think it is just so much more well structured and flexible than the other options.
In https://github.com/chrisjsewell/mistletoe/blob/myst/test.ipynb, you’ll see I have forked it, and with just a few code additions, I have implemented the majority of the markdown -> docutils bridge already, including capturing block token source code line ranges.
If you are in the office soon we can discuss?
Another option I hadn’t considered, but might be attractive especially if we don’t think any of these projects will work for us, is to use a Python wrapper to the
cmark
package. cmark is written in C and is super fast and well-maintained:For one example, here’s a lightweight Python wrapper for this package:
https://pypi.org/project/paka.cmark/
It can be used like:
(note that at least this wrapper library doesn’t output an AST, but it does output XML which we can parse into Python…we’d then need to do another loop to loop for extra syntax we’d want to support, since extensions are not natively supported (and anyway they’d need to be written in C))
This is about 2-3 times faster than mistune as well…
There is also commonmark.js which is the javascript implementation of the commonmark spec. Perhaps that could be utilized somehow as well?
Worth noting is that both
cmark
andcommonmark.js
are very well-maintained, given that they are the official reference implementations of commonmark