move edited/GT TextEquiv to front
See original GitHub issueUnfortunately, since PAGE-XML completely underspecifies what and how TextEquiv
(with or without @index
) is used, applications have to define their own convention. IIUC (please correct me if I’m wrong):
LAREX convention
Existing TextEquivs are kept unchanged. Existing @index=0
is treated as GT. Anything else is treated as prediction, and only the highest position/index is shown.
When manual edits are done, GT is updated or created.
When saving, GT (if available) will become @index=0
and prediction (if available) @index=1
. These two will be appended to any existing TextEquiv.
PageViewer
PageViewer only shows the first index as tooltip (regardless of @index
).
Aletheia
Aletheia only shows the first index as tooltip (regardless of @index
) and does allow editing multiple TextEquivs, but does not set (or even show) their @index
. (It just calls them Variant1
, Variant2
and so on in the GUI.)
@tboenig please correct me if this is not true for the fully licensed version.
OCR-D convention
The current spec says that where multiple TextEquivs are available, @index=1
should be preferred.
However, that’s not at all what is currently implemented across OCR-D: processors read the first TextEquiv (regardless of @index
) and write starting at @index=0
.
(Reason for this behaviour is probably that it’s easier to implement and “works” with PageViewer, and the concrete spec language on that matter came too late… So either we change the spec or we fix the implementation now. @kba?)
Solution
To become interoperable with OCR-D, it would currently suffice to just insert the new TextEquiv elements in front of the existing ones (while keeping all the @index
rules).
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
(Another possibility, suggested by @stefanCCS privately, would be adding one
alto:TextLine/alto:String/alto:Glyph
per character ofpc:TextLine/pc:TextEquiv/pc:Unicode
– with pseudo-coordinates, since we usually do not have word or glyph segmentation available. TheseGlyph
s could then carryVariant
s naturally. But while Variants of different Glyphs are usually independent of each other, here we would have to give them a special interpretation which prevents mixing/recombining local variants – like first glyph first variant with second glyph second variant. Again, there would always be the danger of being confused with actual Glyphs and actual local variants.)The former. (I don’t know how LAREX behaves if multiple index0 versions preexist. But whatever index0 it picks up should be the one that OCR-D will see. Therefore, if index0 existed but was not first, LAREX should move it to the fore.)