Performance of the library
See original GitHub issueI am currently testing the the branch issue-469. Which does not require any manual changes, which is great. I am parsing this file: http://data.ndovloket.nl/netex/htm/NeTEx_HTM__2020-10-12.xml.gz
import time
import gzip
from xsdata.formats.dataclass.parsers.config import ParserConfig
config = ParserConfig(
process_xinclude=False,
fail_on_unknown_properties=False,
)
from xsdata.formats.dataclass.context import XmlContext
from xsdata.formats.dataclass.parsers import XmlParser
print("Before import", time.time())
from netex import PublicationDelivery
print("Before parser", time.time())
parser = XmlParser(context=XmlContext(), config=config)
pd = parser.parse(gzip.open("/var/tmp/NeTEx_HTM__2020-10-12.xml.gz", 'r'), PublicationDelivery)
print("After parser", time.time())
timing_links = {}
for timing_link in pd.data_objects.composite_frame[0].frames.service_frame[0].timing_links.timing_link:
timing_links[timing_link.id] = timing_link.distance
print("After dict", time.time())
print(timing_links)
Before import 1619954488.1244667 Before parser 1619954492.6376452 (4s) After parser 1619954562.600524 (70s) After dict 1619954562.601241
If I compare this with below (which is done within 1 second). I agree that this is not very compatible, but maybe there is a way to just in time deserialise the file just in time.
import gzip
from lxml import etree
etree.parse(gzip.open('/var/tmp/NeTEx_HTM__2020-10-12.xml.gz', 'r'))
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Library performance measurement
It assesses the performance against users' expectations, or it is the testing of an organization or system for effectiveness and efficiency.
Read more >Library performance - IBM
The library performance values show the relative performance of the systems and are not meant to be absolute indicators of performance in your...
Read more >Using key performance indicators to measure library ... - Elsevier
Librarians are faced with measuring usage, quality of service and strategic performance (i.e., how well the library is achieving its outcomes).
Read more >Library Performance - an overview | ScienceDirect Topics
Brophy (2001) summarizes digital library performance indicators based on the opinions from the professional community as part of the EQUINOX project: percentage ...
Read more >LibPMC
The goal of library performance measurement and assessment is to understand how well a library is meeting stakeholder needs, with the aim of...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
You should understand, that you have likely (already) come up with the best XSD tool for Python, and might be the best implementation after JAXB. We have tested a lot implementations for different languages, including C#, and virtually all of them, including the commercial ones, directly fail on the substitution groups, and require “xs:choice” instead. So we are very impressed. You might need to think about generating code for different programming languages exploiting the same generator infrastructure.
There are a few tricks. The obvious one is, generating the code by a string formatter. Hence having a relational database structure, but serializing it by hand to XML. JAXB has an option to export fragments, hence without serializing the entire document at once (consuming large amounts of memory).
I have written an LXML based validator which takes a “constraintless” document (hence ignoring identity key constraint) and validates structure, and implements the constraint checking in a multithreaded python way. It still out performs the “new” libxml2 code, but I think if libxml2 would employ multithreading itself it could still be faster.
This is where the magic happens with libxml2: https://github.com/GNOME/libxml2/commit/faea2fa9b890cc329f33ce518dfa1648e64e14d6
I am not aware of any lazy bindings technique in jaxb but I will do some more research on this. Regarding the partial tree, xsdata is using lxml/xml iterparse and sax interfaces to bind data as soon as they are ready, but java and jaxb performance is out of reach without rewriting a lot of things in c.
I started this project for fun as a side project to stay current with python, in my experience dealing with xsd, there are many many, way to many, different approaches to accomplice the same thing, the NeTEx collection is an excellent example of that 😄 The whole schema has 5 or 6 issues that I am working on that I have never encountered before.
Most bindings libraries are trying to cover the most common practices and there are features in both xsd 1.0 and 1.1 that are simply impossible to implement per language.
How are these documents been generated? gigabytes???
Out of curiosity I tried to validate that 150mb sample against the schema in python using lxml and I gave up after 20 minutes.