Issue with sentence segmentation offsets
See original GitHub issueIn the following example
https://arxiv.org/pdf/2103.12028v1.pdf
there are cases of wrong sentence segmentations, with sentence offsets apparently shifted by a few characters, resulting in word cut. This happens whatever the selected sentence segmenter is, OpenNLP or Pragmatic Segmenter:
<s>Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. T</s>
<s>e quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European...</s>
As it happens with both segmenters, which use different offset calculation methods, it might be due to issues with character encoding.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Auditory Segmentation Based on Onset and Offset ... - DTIC
Onset and offset fronts are vertical contours in the 2-D time- frequency representation. The next step is to match individual onset and offset...
Read more >Finding Sentence and Token Offsets - MorphAdorner
You may want to locate word and sentence boundaries as a first step in text processing. Here we produce a program called SentenceAndTokenOffsets...
Read more >fnl/syntok: Text tokenization and sentence segmentation ...
This module provides the Tokenizer class to tokenize input text into words and symbols (value Tokens), prefixed with (possibly empty) spacing strings, while ......
Read more >Simple and Accountable Segmentation of Marked-up Text
in this regard is sentence segmentation, frequently a fundamental piece of any ... Another issue with simply stripping markup is that it can...
Read more >Pixel Offset Regression (POR) for Single-shot Instance ...
In this paper, we introduce a novel Pixel Offset Re- ... object detector to single-shot instance segmentation system, ... mentation examples.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I haven’t tested but I think this could be fixed by PR #701
Since I’ve did some swimming in this part of the code, I’ve checked again with a fresh mind.
It seems that the footnote
1
(superscript=True) trigger line 234 if, which increases the upperlimit of the sentence. Maybe we should just check that such token is not in the reference list?