Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help with design of a XML based parser

See original GitHub issue

Hi,

TLDR: help me figure out how to parse XMLs using Lark.

I’m looking to write a parser to convert docbook to sphinx via this AST api called docutils.nodes. Docbook is an XML based format to write documentation. Sphinx uses docutils to create beautiful and function documentation. Linux kernel docs used to be in Docbook, but now completely moved to sphinx. I’m looking to convert documentation for healthcare standard called DICOM which is in docbook to sphinx.

Sphinx originally supported Restructured text (RST) alone. Recently, Recommonmark project allowed sphinx to use markdown for docs (this is also relevant to #640). I see parser code which converts markdown AST to docutils.nodes AST. They use self.current_node, self.visit_x() and depart_x (x = node type) to convert the ASTs.

I think this code could be written lot more elegantly using Tree and Transformer API of Lark. I want to take this approach as I convert XML based docbook to docutils.node API. Since this is an XML, I wouldn’t be using lexer/parser of Lark and would be using python’s Element Tree API. Documentation on using custom lexer with Lark is understandably scant.

Would it be possible for you to guide me figure out design for my project with Lark or if I should use Lark at all? I’ll be very grateful.

If this docbook project is successful, I think Lark could probably used to make docutils parser API modern and allow conversion of any document format to sphinx. After all, Pandoc, swiss army knife of document formats, follows similar architecture as Lark.

Thanks for reading through the wall of text 😃

Issue Analytics

State:
Created 3 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

erezshcommented, Aug 8, 2020

It looks pretty simple to me.

See this code:

s = """
 <?xml version="1.0" encoding="UTF-8"?>
 <book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
   <title>Very simple book</title>
   <chapter xml:id="chapter_1">
     <title>Chapter 1</title>
     <para>Hello world!</para>
     <para>I hope that your day is proceeding <emphasis>splendidly</emphasis>!</para>
   </chapter>
   <chapter xml:id="chapter_2">
     <title>Chapter 2</title>
     <para>Hello again, world!</para>
   </chapter>
 </book>
 """

import xml.etree.ElementTree as ET

tree = ET.fromstring(s.strip())


def get_str(node):
    return node.tag.split('}')[-1] + str(node.attrib)

def _pretty(node, level, indent_str):
    if len(list(node)) == 1 and not isinstance(list(node)[0], ET.Element):
        return [ indent_str*level, get_str(node), '\t', '%s' % (list(node)[0],), '\n']

    l = [ indent_str*level, get_str(node), '\n' ]
    for n in list(node):
        if isinstance(n, ET.Element):
            l += _pretty(n, level+1, indent_str)
        else:
            l += [ indent_str*(level+1), '%s' % (n,), '\n' ]

    return l

def pretty(node, indent_str='  '):
    return ''.join(_pretty(node, 0, indent_str))

print(pretty(tree))

Here’s what I get when I run it:

PS C:\code\lark> python -i .\issue641.py
book{'{http://www.w3.org/XML/1998/namespace}id': 'simple_book', 'version': '5.0'}
  title{}
  chapter{'{http://www.w3.org/XML/1998/namespace}id': 'chapter_1'}
    title{}
    para{}
    para{}
      emphasis{}
  chapter{'{http://www.w3.org/XML/1998/namespace}id': 'chapter_2'}
    title{}
    para{}

Pretty much the same structure.

I bet I could write a function that converts this to Lark’s Tree in less than 20 lines. Or you could re-implement the Transformer logic yourself. I think the latter makes more sense, because Tree doesn’t have the attrib / text distinction.

But it’s pretty simple, no? You can use the code in lark/visitors.py to get you started.

0reactions

chsasankcommented, Aug 8, 2020

Thanks a lot for this. I’ll look at lark/visitors.py. Would prefer using existing libraries than writing my own code. 😃 attrib / text is not necessary for my project.