RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces`
See original GitHub issueMETS/PAGE/ALTO provided by digitization workflow software or repositories will not always adhere to the conventions we have in OCR-D. OTOH the workspaces that are the result of OCR-D workflows contains a lot of redundant information that is not relevant for ingestion into production systems or contradict the local conventions of the production system.
Also, our conventions have been shifting and will continue to do so to meet the needs of users and developers.
Many users therefore have developed scripts to preprocess input and postprocess output of OCR-D.
OCR-D/core should provide a processor ocrd-sanitize
which is only concerned with “housekeeping” of workspaces. Possible actions include:
- Pruning of
mets:fileGrp
, either by allowlist or denylist. I.e. removemets:fileGrp
and containingmets:file
(and files on disk) that are not required anymore - regex-based replacement of all
xlink:href
to match local conventions - Removing all but the lowest level of
page:TextEquiv
information in PAGE-XML - Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing
- Upgrading older PAGE-XML namespaces to the latest version (#503)
- Assigning persistent identifiers to work, pages, files …
These are just some ideas, we’d love to hear yours. Please share your post-processing/post-processing scripts or feature requests for such a tool so we can develop a solution together for common tasks.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (10 by maintainers)
Top GitHub Comments
My largest demand for a sanitizer would be ensuring ingest into Kitodo.Presentation / DFG-Viewer works.
According to this we are already close, but…
/alto/Layout/Page/@WIDTH
is extremely important, because Kitodo.Presentation needs to add the DFG footer (which comes in multiples of 1000px width IIUC) and therefore scales the images and thus needs to know by what amount to scale the ALTO coordinates accordinglyDEFAULT
fileGrp (whether by alias to another, existing fileGrp or by renaming I am not sure)FULLTEXT
fileGrp (not sure what to do if multiple versions are available) andMIMETYPE="text/html"
(notapplication/alto+xml
!)LOCTYPE="URL"
(but not sure about the kind of response the webserver needs to give, esp. whether it must understand and convey the correctContent-Type
MIME or may omit it or use some nonsense likeapplication/octet-stream
)mets:file
there must be exactly oneFLocat
(which was already discussed within the remote-local bookkeeping and partial manifestation idea)structMap
ofTYPE="PHYSICAL"
with amets:div
ofTYPE="physSequence"
in it and at least onemets:div
in that withTYPE="page"
(i.e. at least one page) and aORDER
labelstructMap
ofTYPE="LOGICAL"
with amets:div
of someTYPE
in it (“the name is not important”) and at least onemets:div
in that withTYPE
among these labelsstructLink
linking each physical page to at least one logical elementmets:dmdSec
with at least some MODS or TEIHDR metadatamets:amdSec
with at least somemets:techMD
or external namespace metadata and somemets:rightsMD
(with variousdv:rights
specs) andmets:digiprovMD
(withdv:reference
)I stand corrected: As this example by @stefanCCS – METS and ALTO – shows,
MIMETYPE="application/alto+xml"
and ALTO v4.1 do work actually. (That is, newer features are simply ignored.)