Recommended way of annotating line numbers
See original GitHub issueHi @kermitt2
It might be good to clarify the recommended way of annotating line numbers.
I believe one of the most recent on conversation Mattermost:
I think line number could be seen as an "insert", so as a note in the segmentation model
we could imagine adding a type attribute like type="line_number" because it's very specific
ideally it could be recognized as an independent block by pdfalto, instead of having it concatenated with each line of the text - then it would be easier to cover it in GROBID without any particular ad hoc processing, just with the segmentation model which would recognized the line number as a specific area
I believe you also referred to the Annotation guidelines for the ‘segmentation’ model.
The current <page>
numbers seem to always be present at the top level (i.e. body
or other elements close before it). Should the same be true for line numbers?
e.g.:
<note type="line_number">1</note>
<header>The first line</header>
<note type="line_number">2</note>
<header>The second line</header>
You also mentioned that some line numbers might get passed to the fulltext (or header) model. And should be annotated there.
There it would be an issue if line numbers caused the current element to end, e.g. when the author name is broken over multiple lines (example: 10.1101/386813).
Should it be there like this?:
<docAuthor>First name<lb/>
<note type="other">1</note> Last name</docAuthor>
Or is this even wasted effort if it was moved to pdfalto? I don’t really know the full scope of pdfalto. Certainly heuristics can be used to identify most line numbers (and that is what I am using to create the annotations). The position on the page is important. Sometimes there is little gap between the line number and the text though. There may be outliers.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
Yes let’s keep this issue opened until guidelines cover that.
Hello Daniel,
I plan to work on line numbers in April, for the DataSeer project in which I have booked some time for this. It will be handled first by pdfalto, basically outputting line numbers as its own block (like a column). Then I will add guidelines for this in the segmentation model and on how to handle them in other models if they unfortunately goes through (basically as we do with any noise currently). There will also be an effort for dedicated training data.
The text/element flow from pdfalto will be different for line numbers than what it is now, so we should not consider line number with the current pdfalto output (there is one training example for the segmentation model which has line numbers right now, it will be re-annotated). The current pdfalto/grobid can not handle line number currently, it should be considered not supported, and not annotated because it would be just a hack. Normally there should be a good solution ready in May.