question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

regression: ocrd_mets.remove_physical_page broken

See original GitHub issue

See https://github.com/hnesk/browse-ocrd/actions/runs/3573770934/jobs/6008222016

It seems that the (new?) implementation is broken:

  File "/home/runner/work/browse-ocrd/browse-ocrd/ocrd_browser/model/document.py", line 383, in delete_page
    self.workspace.mets.remove_physical_page(page_id)
  File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/ocrd_models/ocrd_mets.py", line 689, in remove_physical_page
    mets_div[0].getparent().remove(mets_div[0])
IndexError: list index out of range

Unfortunately, I cannot pinpoint / dissect, because apparently, @MehmedGIT has effectively erased the history of ocrd_mets.py.

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
kbacommented, Nov 29, 2022

I could fix the problems with OCRD_METS_CACHING and browse-ocrd’s test suite except the test_reorder test. The problem there is that browse-ocrd modifies the underlying XML but the OcrdMets caching does not know about it. We either need to extend the OcrdMets API to offer the functionality (i.e. reordering of pages) or at least a way to let OcrdMets know that it should invalidate and re-fill the cache. @MehmedGIT @hnesk

1reaction
bertskycommented, Nov 30, 2022

However, after checking the source code of reorder(), I am surprised that it breaks the cache. Even when altering the element tree directly. The content of the separate mets:div[@TYPE="page"] elements seems to not change (or does it?), just their order inside the mets:div[@TYPE="physSequence"] element. This must not break the cache since the cache does not depend on order, but on content.

Yes, but look again: it does not in fact break the cache validity at all. And that’s the problem really: What is actually tested as a result, self.page_ids, delegates to OcrdMets.physical_pages, and thus, your cache. So while self.reorder does reorder the pages in the actual element tree, the cached version still contains the old order (because it has not been invalidated yet).

(For example during error recovery, if a processor crashed on one page, but may have already gotten to update the METS before it did.)

I am also interested in this topic. By METS do you mean the XML element tree (eTree) in the memory or the mets file on the disk? I assume it’s the latter. However, I cannot see how the cache invalidation could be useful in that scenario.

No, I did mean the cached state in memory. Let’s say we finally get to have some form of error recovery (e.g. catching anything below Processor.process_page). Now if the processor crashes on one page, but already made its METS action prior to that, and then recovery has the program revert to some dummy or copy behaviour and continue with the next page – clearly, it should also invalidate the cache, so whatever is now in the tree is also in the cache.

Consider these steps and make corrections if needed:

  1. The mets file on the disk gets loaded in the memory (eTree)
  2. The 3 cache dictionaries are filled by iterating the relevant parts of the eTree
  3. An ocr-d processor indirectly modifies the content of the eTree by calling some OcrdMets method
  4. The method called in step 3 modifies the eTree
  5. The method called in step 3 modifies the cache instance/s
  6. The processor calls Workspace.save_mets() to store the eTree on the disk (not sure if that happens here or how often it does? - ideally, it should not)

Depends. In the processing server, we should be able to have the METS only in memory throughout the whole workflow. But in a standard CLI run, as soon as a processor is finished, it needs to serialise to disk. With page-wise processing it obviously depends on just how that is implemented: With the current METS splitting, we are in the latter case, whereas with page-parallel API we are in memory-only territory.

  1. Steps between 3-5(6?) are repeated few times till the current page is processed
  2. The processor finishes processing the current page / The processor fails to process the current page
  3. The processor calls Workspace.save_mets() to store the eTree on the disk
  4. The steps between 3-9(9?) are repeated many times till the ocr-d processor finishes processing all pages

So, AFAIS:

  1. Inconsistency between the XML element tree and the cache could happen only if the processor crashes after completing step 4 and before completing step 5. However, this does/should not affect the mets file on the disk.

Yes. But see above (“continue with next page…”)

  1. If the content of the mets file on the disk changes in step 6 and the processor fails to process the current page in step 8, then we end up with a mets file that is broken/incomplete since the content of the eTree and the cache is broken/incomplete.

Definitely. But error handling could always try to undo the last METS action as part of recovery. (So error handling would involve making “backups” available for rollback, as would a MetsServer naturally.)

  1. If there is no step 6, even if the processor fails to process a page the content of the mets file on the disk will be in a good state and without leftovers from a failed page. The content of the mets file is loaded in the memory and the cache gets filled again.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Available CRAN Packages By Name
Available CRAN Packages By Name ; AgroR, Experimental Statistics and Graphics for Agricultural Sciences ; AgroReg, Regression Analysis Linear and Nonlinear for ...
Read more >
Google Books Online at the University of Michigan Library
Two fileGrps (images and OCR). – Physical structMap tying together the files with any metadata (page numbers or features). METS Object ...
Read more >
Image Segmentation methods for fine-grained OCR ... - Helda
The thesis studies how image segmentation techniques can be used for fine-grained OCR docu- ment layout analysis. How to implement fine-grained page ......
Read more >
POSTER SESSION 2: Thursday, 1 May 2008, 13:30–18:00 Location ...
Conclusion: This study confirm the evidence of a benefit on exercise capacity from physical rehabilitation (peak WR, VO2peak and VO2@AT) in cardiac patients....
Read more >
Untitled
Gasthaus lehmeier nennslingen, Balutschistan hamburg lieferservice, Tennis shoe coloring page, 2006 harley sportster 1200 parts, Yacc video tutorial, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found