question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue with sentence segmentation offsets

See original GitHub issue

In the following example

https://arxiv.org/pdf/2103.12028v1.pdf

there are cases of wrong sentence segmentations, with sentence offsets apparently shifted by a few characters, resulting in word cut. This happens whatever the selected sentence segmenter is, OpenNLP or Pragmatic Segmenter:

<s>Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. T</s>
<s>e quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European...</s> 

As it happens with both segmenters, which use different offset calculation methods, it might be due to issues with character encoding.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
lfoppianocommented, Apr 29, 2021

I haven’t tested but I think this could be fixed by PR #701

0reactions
lfoppianocommented, Jul 28, 2022

Since I’ve did some swimming in this part of the code, I’ve checked again with a fresh mind.

It seems that the footnote 1 (superscript=True) trigger line 234 if, which increases the upperlimit of the sentence. Maybe we should just check that such token is not in the reference list?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Auditory Segmentation Based on Onset and Offset ... - DTIC
Onset and offset fronts are vertical contours in the 2-D time- frequency representation. The next step is to match individual onset and offset...
Read more >
Finding Sentence and Token Offsets - MorphAdorner
You may want to locate word and sentence boundaries as a first step in text processing. Here we produce a program called SentenceAndTokenOffsets...
Read more >
fnl/syntok: Text tokenization and sentence segmentation ...
This module provides the Tokenizer class to tokenize input text into words and symbols (value Tokens), prefixed with (possibly empty) spacing strings, while ......
Read more >
Simple and Accountable Segmentation of Marked-up Text
in this regard is sentence segmentation, frequently a fundamental piece of any ... Another issue with simply stripping markup is that it can...
Read more >
Pixel Offset Regression (POR) for Single-shot Instance ...
In this paper, we introduce a novel Pixel Offset Re- ... object detector to single-shot instance segmentation system, ... mentation examples.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found