question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Recommended way of annotating line numbers

See original GitHub issue

Hi @kermitt2

It might be good to clarify the recommended way of annotating line numbers.

I believe one of the most recent on conversation Mattermost:

I think line number could be seen as an "insert", so as a note in the segmentation model
we could imagine adding a type attribute like type="line_number" because it's very specific
ideally it could be recognized as an independent block by pdfalto, instead of having it concatenated with each line of the text - then it would be easier to cover it in GROBID without any particular ad hoc processing, just with the segmentation model which would recognized the line number as a specific area

I believe you also referred to the Annotation guidelines for the ‘segmentation’ model.

The current <page> numbers seem to always be present at the top level (i.e. body or other elements close before it). Should the same be true for line numbers?

e.g.:

<note type="line_number">1</note>
<header>The first line</header>
<note type="line_number">2</note>
<header>The second line</header>

You also mentioned that some line numbers might get passed to the fulltext (or header) model. And should be annotated there.

There it would be an issue if line numbers caused the current element to end, e.g. when the author name is broken over multiple lines (example: 10.1101/386813).

Should it be there like this?:

<docAuthor>First name<lb/>
<note type="other">1</note> Last name</docAuthor>

Or is this even wasted effort if it was moved to pdfalto? I don’t really know the full scope of pdfalto. Certainly heuristics can be used to identify most line numbers (and that is what I am using to create the annotations). The position on the page is important. Sometimes there is little gap between the line number and the text though. There may be outliers.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
kermitt2commented, Mar 16, 2020

Yes let’s keep this issue opened until guidelines cover that.

1reaction
kermitt2commented, Mar 16, 2020

Hello Daniel,

I plan to work on line numbers in April, for the DataSeer project in which I have booked some time for this. It will be handled first by pdfalto, basically outputting line numbers as its own block (like a column). Then I will add guidelines for this in the segmentation model and on how to handle them in other models if they unfortunately goes through (basically as we do with any noise currently). There will also be an effort for dedicated training data.

The text/element flow from pdfalto will be different for line numbers than what it is now, so we should not consider line number with the current pdfalto output (there is one training example for the segmentation model which has line numbers right now, it will be re-annotated). The current pdfalto/grobid can not handle line number currently, it should be considered not supported, and not annotated because it would be just a hack. Normally there should be a good solution ready in May.

Read more comments on GitHub >

github_iconTop Results From Across the Web

4 ways to Annotate a text:
Annotating a text, or marking the pages with notes, is an excellent, if not essential, way to make the most out of the...
Read more >
How to annotate PDF documents with line numbers on left or ...
The easiest way to do that is to use either Microsoft Excel or OpenOffice Calc. Insert the number 1 in the first column...
Read more >
Three Ways to Annotate Your Graphs | by Data@Urban
Let's start with the easiest type of annotation: adding labels. The default in many software tools is to create a data legend and...
Read more >
Annotating a Text - Reading and Study Strategies
Sometimes called "close reading," annotating usually involves highlighting or underlining key pieces of text and making notes in the margins of ...
Read more >
10 Rules for Graph Annotations - Speaking PowerPoint
Rule #7: Go light on the boxes and line colors. Boxes are okay but go light here, using thin lightly colored lines. Again,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found