Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

performance of input_files on large workspaces

See original GitHub issue

By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.

Here’s a typical scenario:

ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)
ocrd-tesserocr-recognize runs afterwards and queries its self.input_files: https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd/ocrd/processor/base.py#L294-L299
this searches through all mets:file entries, matching them for fileGrp (which is reasonably fast, it only gets a little inefficient when additionally filtering by pageId): https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L176-L208
Then in line 298 (and again further below) it queries OcrdFile.pageId: https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_file.py#L116-L122
This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files): https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L434-L441

A little cosmetics like turning OcrdFile.pageId into a functools.cached_property won’t help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

Issue Analytics

State:
Created 2 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

mweidlingcommented, May 2, 2022

Just to add another use-case to the scenario: even a simple OcrdMets.add_file can become inefficient on large workspaces. (Becoming as slow as 1 op/sec.) The reason is that it looks for existing files of the same ID first:

https://github.com/OCR-D/core/blob/ad32b00bf692c71a43dacac805e324625bbc447a/ocrd_models/ocrd_models/ocrd_mets.py#L304

So the proposed caching should also happen here.

Thanks for pointing that out, @bertsky !

1reaction

MehmedGITcommented, May 1, 2022

@bertsky I have pushed my latest changes to the benchmarking branch. I have not been working on that experiment after that. @mweidling is investigating this topic in more depth and I am available for discussions and support if needed. My personal opinion is that we should try to optimize the OcrdMets functionalities as soon as possible.