question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

performance of input_files on large workspaces

See original GitHub issue

By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.

Here’s a typical scenario:

  1. ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)
  2. ocrd-tesserocr-recognize runs afterwards and queries its self.input_files: https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd/ocrd/processor/base.py#L294-L299
  3. this searches through all mets:file entries, matching them for fileGrp (which is reasonably fast, it only gets a little inefficient when additionally filtering by pageId): https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L176-L208
  4. Then in line 298 (and again further below) it queries OcrdFile.pageId: https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_file.py#L116-L122
  5. This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files): https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L434-L441

A little cosmetics like turning OcrdFile.pageId into a functools.cached_property won’t help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mweidlingcommented, May 2, 2022

Just to add another use-case to the scenario: even a simple OcrdMets.add_file can become inefficient on large workspaces. (Becoming as slow as 1 op/sec.) The reason is that it looks for existing files of the same ID first:

https://github.com/OCR-D/core/blob/ad32b00bf692c71a43dacac805e324625bbc447a/ocrd_models/ocrd_models/ocrd_mets.py#L304

So the proposed caching should also happen here.

Thanks for pointing that out, @bertsky !

1reaction
MehmedGITcommented, May 1, 2022

@bertsky I have pushed my latest changes to the benchmarking branch. I have not been working on that experiment after that. @mweidling is investigating this topic in more depth and I am available for discussions and support if needed. My personal opinion is that we should try to optimize the OcrdMets functionalities as soon as possible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[performance] [filesystem] Slow loading with large workspace ...
Theia takes an exceptionally long time starting when using a large directory as the workspace root. AFICT the issue started to manifest ...
Read more >
Performance: Number of Objects in Workspace
Our top level assemblies contain more than 60,000 objects, but we can only efficiently work with under 3000 of them in the workspace....
Read more >
Errors about input files having missing or incompatible contigs
This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own...
Read more >
Observability patterns and metrics - Azure Example Scenarios
This solution demonstrates observability patterns and metrics to improve the processing performance of a big data system that uses Azure Databricks.
Read more >
Using clusters for large-scale technical computing in the cloud
This solution provides guidance for performing large-scale technical computing on Google Cloud. Many technical computing apps require large ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found