question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces`

See original GitHub issue

METS/PAGE/ALTO provided by digitization workflow software or repositories will not always adhere to the conventions we have in OCR-D. OTOH the workspaces that are the result of OCR-D workflows contains a lot of redundant information that is not relevant for ingestion into production systems or contradict the local conventions of the production system.

Also, our conventions have been shifting and will continue to do so to meet the needs of users and developers.

Many users therefore have developed scripts to preprocess input and postprocess output of OCR-D.

OCR-D/core should provide a processor ocrd-sanitize which is only concerned with “housekeeping” of workspaces. Possible actions include:

  • Pruning of mets:fileGrp, either by allowlist or denylist. I.e. remove mets:fileGrp and containing mets:file (and files on disk) that are not required anymore
  • regex-based replacement of all xlink:href to match local conventions
  • Removing all but the lowest level of page:TextEquiv information in PAGE-XML
  • Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing
  • Upgrading older PAGE-XML namespaces to the latest version (#503)
  • Assigning persistent identifiers to work, pages, files …

These are just some ideas, we’d love to hear yours. Please share your post-processing/post-processing scripts or feature requests for such a tool so we can develop a solution together for common tasks.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
bertskycommented, Jun 24, 2021

My largest demand for a sanitizer would be ensuring ingest into Kitodo.Presentation / DFG-Viewer works.

According to this we are already close, but…

  • our ALTO must be v2.0 currently (see this issue) – unfortunately the DFG-Viewer profile does not say much more, although we already know that SP/newlines are an issue and /alto/Layout/Page/@WIDTH is extremely important, because Kitodo.Presentation needs to add the DFG footer (which comes in multiples of 1000px width IIUC) and therefore scales the images and thus needs to know by what amount to scale the ALTO coordinates accordingly
  • that means the XSLT from ocr-filetransform will not in general give the correct results for OCR-D generated PAGE, we should switch and recommend/document page-to-alto
  • our METS itself needs to conform to DFG-Viewer profile, which means that notably
    • images must be in the DEFAULT fileGrp (whether by alias to another, existing fileGrp or by renaming I am not sure)
    • ALTO must be in the FULLTEXT fileGrp (not sure what to do if multiple versions are available) and MIMETYPE="text/html" (not application/alto+xml!)
    • files must be of LOCTYPE="URL" (but not sure about the kind of response the webserver needs to give, esp. whether it must understand and convey the correct Content-Type MIME or may omit it or use some nonsense like application/octet-stream)
    • for every mets:file there must be exactly one FLocat (which was already discussed within the remote-local bookkeeping and partial manifestation idea)
    • there must be a structMap of TYPE="PHYSICAL" with a mets:div of TYPE="physSequence" in it and at least one mets:div in that with TYPE="page" (i.e. at least one page) and a ORDER label
    • there must be a structMap of TYPE="LOGICAL" with a mets:div of some TYPE in it (“the name is not important”) and at least one mets:div in that with TYPE among these labels
    • there must be a structLink linking each physical page to at least one logical element
    • there must be a mets:dmdSec with at least some MODS or TEIHDR metadata
    • there must be a mets:amdSec with at least some mets:techMD or external namespace metadata and some mets:rightsMD (with various dv:rights specs) and mets:digiprovMD (with dv:reference)
0reactions
bertskycommented, Jun 25, 2021

I stand corrected: As this example by @stefanCCSMETS and ALTO – shows, MIMETYPE="application/alto+xml" and ALTO v4.1 do work actually. (That is, newer features are simply ignored.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pre-Processing in OCR!!! - Towards Data Science
So, here we are going to learn some of the most basic and commonly used preprocessing techniques on an image.
Read more >
Improving OCR Post Processing with Machine Learning Tools
One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR...
Read more >
CRAN Packages By Name - The Comprehensive R Archive Network
Available CRAN Packages By Name ; AATtools, Reliability and Scoring Routines for the Approach-Avoidance Task ; ABACUS, Apps Based Activities for Communicating and ......
Read more >
CRAN Packages By Name
abbyyR, Access to Abbyy Optical Character Recognition (OCR) API ... apmsWAPP, Pre- and Postprocessing for AP-MS data analysis based on spectral counts.
Read more >
utensil/awesome-stars: A curated list of my GitHub stars! - awesome ...
AutoHotkey/Ahk2Exe - Official AutoHotkey script compiler - written itself in AutoHotkey ... tesseract-ocr/tesseract - Tesseract Open Source OCR Engine (main ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found