Help with design of a XML based parser
See original GitHub issueHi,
TLDR: help me figure out how to parse XMLs using Lark.
I’m looking to write a parser to convert docbook to sphinx via this AST api called docutils.nodes
. Docbook is an XML based format to write documentation. Sphinx uses docutils
to create beautiful and function documentation. Linux kernel docs used to be in Docbook, but now completely moved to sphinx. I’m looking to convert documentation for healthcare standard called DICOM which is in docbook to sphinx.
Sphinx originally supported Restructured text (RST) alone. Recently, Recommonmark project allowed sphinx to use markdown for docs (this is also relevant to #640). I see parser code which converts markdown AST to docutils.nodes
AST. They use self.current_node
, self.visit_x()
and depart_x
(x = node type) to convert the ASTs.
I think this code could be written lot more elegantly using Tree and Transformer API of Lark. I want to take this approach as I convert XML based docbook to docutils.node API. Since this is an XML, I wouldn’t be using lexer/parser of Lark and would be using python’s Element Tree API. Documentation on using custom lexer with Lark is understandably scant.
Would it be possible for you to guide me figure out design for my project with Lark or if I should use Lark at all? I’ll be very grateful.
If this docbook project is successful, I think Lark could probably used to make docutils parser API modern and allow conversion of any document format to sphinx. After all, Pandoc, swiss army knife of document formats, follows similar architecture as Lark.
Thanks for reading through the wall of text 😃
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
It looks pretty simple to me.
See this code:
Here’s what I get when I run it:
Pretty much the same structure.
I bet I could write a function that converts this to Lark’s
Tree
in less than 20 lines. Or you could re-implement the Transformer logic yourself. I think the latter makes more sense, because Tree doesn’t have theattrib / text
distinction.But it’s pretty simple, no? You can use the code in
lark/visitors.py
to get you started.Thanks a lot for this. I’ll look at
lark/visitors.py
. Would prefer using existing libraries than writing my own code. 😃attrib / text
is not necessary for my project.