Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue with sentence segmentation offsets

See original GitHub issue

In the following example

https://arxiv.org/pdf/2103.12028v1.pdf

there are cases of wrong sentence segmentations, with sentence offsets apparently shifted by a few characters, resulting in word cut. This happens whatever the selected sentence segmenter is, OpenNLP or Pragmatic Segmenter:

<s>Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. T</s>
<s>e quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European...</s>

As it happens with both segmenters, which use different offset calculation methods, it might be due to issues with character encoding.

Issue Analytics

State:
Created 2 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

lfoppianocommented, Apr 29, 2021

I haven’t tested but I think this could be fixed by PR #701

0reactions

lfoppianocommented, Jul 28, 2022

Since I’ve did some swimming in this part of the code, I’ve checked again with a fresh mind.

It seems that the footnote 1 (superscript=True) trigger line 234 if, which increases the upperlimit of the sentence. Maybe we should just check that such token is not in the reference list?

Top Results From Across the Web

Auditory Segmentation Based on Onset and Offset ... - DTIC

Onset and offset fronts are vertical contours in the 2-D time- frequency representation. The next step is to match individual onset and offset...

Finding Sentence and Token Offsets - MorphAdorner

You may want to locate word and sentence boundaries as a first step in text processing. Here we produce a program called SentenceAndTokenOffsets...

fnl/syntok: Text tokenization and sentence segmentation ...

This module provides the Tokenizer class to tokenize input text into words and symbols (value Tokens), prefixed with (possibly empty) spacing strings, while ......

Simple and Accountable Segmentation of Marked-up Text

in this regard is sentence segmentation, frequently a fundamental piece of any ... Another issue with simply stripping markup is that it can...

Pixel Offset Regression (POR) for Single-shot Instance ...

In this paper, we introduce a novel Pixel Offset Re- ... object detector to single-shot instance segmentation system, ... mentation examples.