performance of input_files on large workspaces
See original GitHub issueBy implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.
Here’s a typical scenario:
- ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)
- ocrd-tesserocr-recognize runs afterwards and queries its
self.input_files
: https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd/ocrd/processor/base.py#L294-L299 - this searches through all
mets:file
entries, matching them forfileGrp
(which is reasonably fast, it only gets a little inefficient when additionally filtering bypageId
): https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L176-L208 - Then in line 298 (and again further below) it queries
OcrdFile.pageId
: https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_file.py#L116-L122 - This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of
input_files
): https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L434-L441
A little cosmetics like turning OcrdFile.pageId
into a functools.cached_property
won’t help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (3 by maintainers)
Top GitHub Comments
Thanks for pointing that out, @bertsky !
@bertsky I have pushed my latest changes to the benchmarking branch. I have not been working on that experiment after that. @mweidling is investigating this topic in more depth and I am available for discussions and support if needed. My personal opinion is that we should try to optimize the OcrdMets functionalities as soon as possible.