question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different segmentation with Spacy and when using pySBD directly

See original GitHub issue

Firstly thank you for this project - I was lucky to find it and it is really useful

I seem to have found a case where the segmentation is behaving differently when run within the Spacy pipeline and when run using pySBD directly. I stumbled on it with my own text where a sentence after a previous sentence that was in quotes was being lumped together. I looked through the Golden Rules and found this wasn’t expected and then noticed that even with the text in one of your tests it acts differently in Spacy.

To reproduce run these two bits of code:

from pysbd.utils import PySBDFactory
nlp = spacy.blank('en')
nlp.add_pipe(PySBDFactory(nlp))
doc = nlp("She turned to him, \"This is great.\" She held the book out to show him.")
for sent in doc.sents:
    print(str(sent).strip() + '\n')

She turned to him, “This is great.” She held the book out to show him.

import pysbd
text = "She turned to him, \"This is great.\" She held the book out to show him."
seg = pysbd.Segmenter(language="en", clean=False)
#print(seg.segment(text))
for sent in seg.segment(text):
    print(str(sent).strip() + '\n')

She turned to him, “This is great.”

She held the book out to show him.

The second way is the desired output (based on the rules at least)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
nipunsadvilkarcommented, Feb 12, 2020

@jenojp Have a look at a new issue which I just created. The solution might work to get proper segmentation in both with or without using spaCy.

1reaction
jenojpcommented, Jan 29, 2020

@nipunsadvilkar I’ll keep you posted if I can get some free time to look into it more. This is a really promising project!

Read more comments on GitHub >

github_iconTop Results From Across the Web

pySBD - python Sentence Boundary Disambiguation - spaCy
pySBD is 'real-world' sentence segmenter which extracts reasonable sentences when the format and domain of the input text are unknown.
Read more >
How to Perform Sentence Segmentation or Sentence ...
Sentence Segmentation or Sentence Tokenization is the process of identifying different sentences among group of words. Spacy library ...
Read more >
(PDF) PySBD: Pragmatic Sentence Boundary Disambiguation
PDF | In this paper, we present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages.
Read more >
separate texts into sentences NLTK vs spaCy - Stack Overflow
By default, spaCy uses its dependency parser to do sentence segmentation, which requires loading a statistical model.
Read more >
pySBD: Python Sentence Boundary Disambiguation (SBD)
Segmenter(language="en", clean=False) print(seg.segment(text)) # ['My name is Jonas E. Smith.', 'Please turn to p. 55.'] Use pysbd as a spaCy pipeline ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found