question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

move edited/GT TextEquiv to front

See original GitHub issue

Unfortunately, since PAGE-XML completely underspecifies what and how TextEquiv (with or without @index) is used, applications have to define their own convention. IIUC (please correct me if I’m wrong):

LAREX convention

Existing TextEquivs are kept unchanged. Existing @index=0 is treated as GT. Anything else is treated as prediction, and only the highest position/index is shown.

When manual edits are done, GT is updated or created.

When saving, GT (if available) will become @index=0 and prediction (if available) @index=1. These two will be appended to any existing TextEquiv.

PageViewer

PageViewer only shows the first index as tooltip (regardless of @index).

Aletheia

Aletheia only shows the first index as tooltip (regardless of @index) and does allow editing multiple TextEquivs, but does not set (or even show) their @index. (It just calls them Variant1, Variant2 and so on in the GUI.)

@tboenig please correct me if this is not true for the fully licensed version.

OCR-D convention

The current spec says that where multiple TextEquivs are available, @index=1 should be preferred.

However, that’s not at all what is currently implemented across OCR-D: processors read the first TextEquiv (regardless of @index) and write starting at @index=0.

(Reason for this behaviour is probably that it’s easier to implement and “works” with PageViewer, and the concrete spec language on that matter came too late… So either we change the spec or we fix the implementation now. @kba?)

Solution

To become interoperable with OCR-D, it would currently suffice to just insert the new TextEquiv elements in front of the existing ones (while keeping all the @index rules).

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
bertskycommented, Oct 4, 2021

… without severely reinterpreting/abusing the ALTO representation. (For example …

(Another possibility, suggested by @stefanCCS privately, would be adding one alto:TextLine/alto:String/alto:Glyph per character of pc:TextLine/pc:TextEquiv/pc:Unicode – with pseudo-coordinates, since we usually do not have word or glyph segmentation available. These Glyphs could then carry Variants naturally. But while Variants of different Glyphs are usually independent of each other, here we would have to give them a special interpretation which prevents mixing/recombining local variants – like first glyph first variant with second glyph second variant. Again, there would always be the danger of being confused with actual Glyphs and actual local variants.)

0reactions
bertskycommented, Oct 8, 2021

Is the expected behavior that only completely new TextEquiv elements – as in ‘no TextEquiv[@index="0"] element existed prior to adding it’ – get inserted as first child or should this also apply when users edit the content of already existing TextEquiv[@index="0"]?

The former. (I don’t know how LAREX behaves if multiple index0 versions preexist. But whatever index0 it picks up should be the one that OCR-D will see. Therefore, if index0 existed but was not first, LAREX should move it to the fore.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · OCR4all/LAREX - GitHub
move edited/GT TextEquiv to front Priority: Medium Type: Enhancement Indicates an enhancement proposal for an existing feature. #282 opened on Sep 3, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found