question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Matching PAGE imageFilename to mets:file when imageFilename is not a URL

See original GitHub issue

Scenario:

  1. Image files and PAGE referencing those image files by relative filepath:

    <Page imageFilename="foo.tif"/>
    
  2. Create a METS file and run workspace add:

    <mets:file GROUPID="page0001" xlink:href="file://path/to/bla/foo.tif"
    

Now the PAGE imageFilename and xlink:href of the corresponding mets:file do not match anymore.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:34 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
kbacommented, Sep 5, 2019

Revisiting this with @tboenig:

  • imageFilename in PAGE must always be a relative file path relative to that PAGE file, otherwise tools like Aletheia or PAGEViewer won’t work
  • mets:FLocat is ideally a relative path from the mets.xml

So we need logic to determine the relative path from mets.xml to image by resolving imageFilename of a PAGE against the relative path to that PAGE.

  • mets.xml: OCR-D-PAGE/foo.xml
  • OCR-D-PAGE/foo.xml: …/OCR-D-IMG/foo.tif
  • => OCR-D-IMG/foo.tif <- mets:FLocat of that image in mets.xml
0reactions
bertskycommented, Jan 16, 2020
* Moving images and PAGE to the workspace will require changing the input PAGE. Not really a question, just a statement

Yes, that’s crucial. If we take this seriously, ocrd workspace add on PAGE-XML files will either take control of that file or make a copy of it (under the “right” path).

* Also do this for AlternativeImage? Does anyone beside us even use them? I suppose yes and no.

I guess we have to consider the possibility. If we solve this conceptually for Page/@imageFilename, it should work the same for AlternativeImage/@filename though.

* How to determine file metadata for the `imageFilename`? Media Type can be guessed but what `mets:fileGrp` to add the images to? Maybe the filegroup used as the input plus suffix `-IMG`?

IIUC you assume here that ocrd workspace add will be responsible for adding the image file along with the PAGE-XML file passed to it. We could have other provisions (like assuming the image file must already have been added by then), but let’s follow this logic for now:

Yes, the image could be placed under a fileGrp implicitly derived from the fileGrp for the PAGE-XML, or even the same fileGrp (just with a different MIME type and not appearing in the structMap).

Let’s make it toggleable with a --include-page-images/–no-include-page-images or similar flag.

If we add an option, why not just the name of the image file group (or none for “ignore images”)?

* Any issues that arise from necessary conventions for this are the user's responsibility, i.e. if they want to set a different name or different media type for an image, they either need to post-process the XML themselves or not use this feature and do the image adding themselves as before

Right. And let’s think about the second use-case (adding PAGE-XML after image) more thoroughly: Now ocrd workspace add can go looking for the (basename of the) filename in the (image) flocat URLs of the METS, and calculate the new relative path for the PAGE-XML under its destination directory. If it does not find an image with that filename, it can still go looking for an image with the same pageId. And then it can fail loudly.

Personally, I think this is the more sensible interface than add-image-via-PAGE.

Let’s default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.

This got me confused: I though we are talking about adding PAGE-XML files here?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Nginx Match Image Filename Based On Part of URL?
UPDATE I was able to get it to match and look up the filename only from what appears to be the right directory,...
Read more >
ocrd.cli.workspace module — ocrd 2.41.0 documentation
Add a file or http(s) URL FNAME to METS in a workspace. If FNAME is not an http(s) URL and is not a...
Read more >
OCR-D/Lobby - Gitter
This is the case for OCR-D-IMG images or any PAGE-XML file group, but not the ... So the derived images referenced by their...
Read more >
Google spreadheets get image filename instead of URL ...
Would not it be better to have 2 columns, one with shared link and an other with the filename ? The funny part...
Read more >
Greenstone tutorial exercises (2019)
If the link is to a document that is not in the collection, ... Image, extracted metadata that reflects an image's filename, which...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found