question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help with design of a XML based parser

See original GitHub issue

Hi,

TLDR: help me figure out how to parse XMLs using Lark.

I’m looking to write a parser to convert docbook to sphinx via this AST api called docutils.nodes. Docbook is an XML based format to write documentation. Sphinx uses docutils to create beautiful and function documentation. Linux kernel docs used to be in Docbook, but now completely moved to sphinx. I’m looking to convert documentation for healthcare standard called DICOM which is in docbook to sphinx.

Sphinx originally supported Restructured text (RST) alone. Recently, Recommonmark project allowed sphinx to use markdown for docs (this is also relevant to #640). I see parser code which converts markdown AST to docutils.nodes AST. They use self.current_node, self.visit_x() and depart_x (x = node type) to convert the ASTs.

I think this code could be written lot more elegantly using Tree and Transformer API of Lark. I want to take this approach as I convert XML based docbook to docutils.node API. Since this is an XML, I wouldn’t be using lexer/parser of Lark and would be using python’s Element Tree API. Documentation on using custom lexer with Lark is understandably scant.

Would it be possible for you to guide me figure out design for my project with Lark or if I should use Lark at all? I’ll be very grateful.

If this docbook project is successful, I think Lark could probably used to make docutils parser API modern and allow conversion of any document format to sphinx. After all, Pandoc, swiss army knife of document formats, follows similar architecture as Lark.

Thanks for reading through the wall of text 😃

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
erezshcommented, Aug 8, 2020

It looks pretty simple to me.

See this code:

s = """
 <?xml version="1.0" encoding="UTF-8"?>
 <book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
   <title>Very simple book</title>
   <chapter xml:id="chapter_1">
     <title>Chapter 1</title>
     <para>Hello world!</para>
     <para>I hope that your day is proceeding <emphasis>splendidly</emphasis>!</para>
   </chapter>
   <chapter xml:id="chapter_2">
     <title>Chapter 2</title>
     <para>Hello again, world!</para>
   </chapter>
 </book>
 """

import xml.etree.ElementTree as ET

tree = ET.fromstring(s.strip())


def get_str(node):
    return node.tag.split('}')[-1] + str(node.attrib)

def _pretty(node, level, indent_str):
    if len(list(node)) == 1 and not isinstance(list(node)[0], ET.Element):
        return [ indent_str*level, get_str(node), '\t', '%s' % (list(node)[0],), '\n']

    l = [ indent_str*level, get_str(node), '\n' ]
    for n in list(node):
        if isinstance(n, ET.Element):
            l += _pretty(n, level+1, indent_str)
        else:
            l += [ indent_str*(level+1), '%s' % (n,), '\n' ]

    return l

def pretty(node, indent_str='  '):
    return ''.join(_pretty(node, 0, indent_str))

print(pretty(tree))

Here’s what I get when I run it:

PS C:\code\lark> python -i .\issue641.py
book{'{http://www.w3.org/XML/1998/namespace}id': 'simple_book', 'version': '5.0'}
  title{}
  chapter{'{http://www.w3.org/XML/1998/namespace}id': 'chapter_1'}
    title{}
    para{}
    para{}
      emphasis{}
  chapter{'{http://www.w3.org/XML/1998/namespace}id': 'chapter_2'}
    title{}
    para{}

Pretty much the same structure.

I bet I could write a function that converts this to Lark’s Tree in less than 20 lines. Or you could re-implement the Transformer logic yourself. I think the latter makes more sense, because Tree doesn’t have the attrib / text distinction.

But it’s pretty simple, no? You can use the code in lark/visitors.py to get you started.

0reactions
chsasankcommented, Aug 8, 2020

Thanks a lot for this. I’ll look at lark/visitors.py. Would prefer using existing libraries than writing my own code. 😃 attrib / text is not necessary for my project.

Read more comments on GitHub >

github_iconTop Results From Across the Web

XML Parse Tool | Alteryx Help
Use the XML Parse tool to parse Extensible Markup Language (XML) into individual fields. See Reading XML for more information on how Alteryx ......
Read more >
Parsing and serializing XML - Developer guides | MDN
Constructs a DOM tree by parsing a string containing XML, returning a XMLDocument or Document as appropriate based on the input data.
Read more >
XML Parser - W3Schools
All major browsers have a built-in XML parser to access and manipulate XML. ... The parser creates a new XML DOM object using...
Read more >
What does the design of a XML Parser look like? - Quora
XML parsing means getting the contents of an XML document, to convert it into an in-memory form for the program as a whole...
Read more >
XML - Parsers - Tutorialspoint
XML - Parsers, XML parser is a software library or a package that provides interface for client applications to work with XML documents....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found