question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of the library

See original GitHub issue

I am currently testing the the branch issue-469. Which does not require any manual changes, which is great. I am parsing this file: http://data.ndovloket.nl/netex/htm/NeTEx_HTM__2020-10-12.xml.gz

import time
import gzip

from xsdata.formats.dataclass.parsers.config import ParserConfig
config = ParserConfig(
     process_xinclude=False,
     fail_on_unknown_properties=False,
)
from xsdata.formats.dataclass.context import XmlContext
from xsdata.formats.dataclass.parsers import XmlParser

print("Before import", time.time())

from netex import PublicationDelivery

print("Before parser", time.time())

parser = XmlParser(context=XmlContext(), config=config)
pd = parser.parse(gzip.open("/var/tmp/NeTEx_HTM__2020-10-12.xml.gz", 'r'), PublicationDelivery)

print("After parser", time.time())

timing_links = {}
for timing_link in pd.data_objects.composite_frame[0].frames.service_frame[0].timing_links.timing_link:
     timing_links[timing_link.id] = timing_link.distance

print("After dict", time.time())

print(timing_links)

Before import 1619954488.1244667 Before parser 1619954492.6376452 (4s) After parser 1619954562.600524 (70s) After dict 1619954562.601241

If I compare this with below (which is done within 1 second). I agree that this is not very compatible, but maybe there is a way to just in time deserialise the file just in time.

import gzip
from lxml import etree
etree.parse(gzip.open('/var/tmp/NeTEx_HTM__2020-10-12.xml.gz', 'r'))

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
skinkiecommented, May 2, 2021

You should understand, that you have likely (already) come up with the best XSD tool for Python, and might be the best implementation after JAXB. We have tested a lot implementations for different languages, including C#, and virtually all of them, including the commercial ones, directly fail on the substitution groups, and require “xs:choice” instead. So we are very impressed. You might need to think about generating code for different programming languages exploiting the same generator infrastructure.

There are a few tricks. The obvious one is, generating the code by a string formatter. Hence having a relational database structure, but serializing it by hand to XML. JAXB has an option to export fragments, hence without serializing the entire document at once (consuming large amounts of memory).

I have written an LXML based validator which takes a “constraintless” document (hence ignoring identity key constraint) and validates structure, and implements the constraint checking in a multithreaded python way. It still out performs the “new” libxml2 code, but I think if libxml2 would employ multithreading itself it could still be faster.

This is where the magic happens with libxml2: https://github.com/GNOME/libxml2/commit/faea2fa9b890cc329f33ce518dfa1648e64e14d6

1reaction
tefracommented, May 2, 2021

I am not aware of any lazy bindings technique in jaxb but I will do some more research on this. Regarding the partial tree, xsdata is using lxml/xml iterparse and sax interfaces to bind data as soon as they are ready, but java and jaxb performance is out of reach without rewriting a lot of things in c.

I started this project for fun as a side project to stay current with python, in my experience dealing with xsd, there are many many, way to many, different approaches to accomplice the same thing, the NeTEx collection is an excellent example of that 😄 The whole schema has 5 or 6 issues that I am working on that I have never encountered before.

Most bindings libraries are trying to cover the most common practices and there are features in both xsd 1.0 and 1.1 that are simply impossible to implement per language.

How are these documents been generated? gigabytes???

Out of curiosity I tried to validate that 150mb sample against the schema in python using lxml and I gave up after 20 minutes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Library performance measurement
It assesses the performance against users' expectations, or it is the testing of an organization or system for effectiveness and efficiency.
Read more >
Library performance - IBM
The library performance values show the relative performance of the systems and are not meant to be absolute indicators of performance in your...
Read more >
Using key performance indicators to measure library ... - Elsevier
Librarians are faced with measuring usage, quality of service and strategic performance (i.e., how well the library is achieving its outcomes).
Read more >
Library Performance - an overview | ScienceDirect Topics
Brophy (2001) summarizes digital library performance indicators based on the opinions from the professional community as part of the EQUINOX project: percentage ...
Read more >
LibPMC
The goal of library performance measurement and assessment is to understand how well a library is meeting stakeholder needs, with the aim of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found