question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Forcing sentence segmentation for newlines

See original GitHub issue

Is there a way to force sentence segmentation when a newline \n character is found?

For example,

Hey Honnibal,

This is a great library for 2 reasons:
 - It's fast
 - It's accurate

This is parsed as 1 sentence using nlp(text). However, it’d be great if it was parsed as 4 sentences, because of the newline. E-mail data tends to be in this format.

I’ve searched the docs but couldn’t find anything. Is there a workaround?

Thanks!

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
honnibalcommented, May 5, 2016

I think it’s weird at the moment that there’s really no way to set the sentence boundaries yourself. If we had some way to do that, then users could set the attribute before parsing. Then we just have to make sure the transition system respects the existing data in the token.

The tricky thing is that writing to token.sent_start might leave the parse in an invalid state, and I’m not sure I like the idea of quietly fixing up a bunch of other words’ .dep, .head etc attributes because you wrote to one token’s sent_start. A doc.break_sent(i) method is probably more sensible. On the other hand, if doc.is_parsed == False, then writing the sentence boundary should be unproblematic.

We should also have a SENT_START attribute. It’s pretty weird that it’s missing — it means there’s no way to export the sentence boundaries into the array at the moment. We would have to decide how to handle it in doc.from_array though.

So, here’s my proposal. It’s designed for now not to break anything, and to be better than the current status quo:

  • Add a SENT_START attribute.
  • Raise error if both SENT_START and HEAD are being set at the same time in doc.from_array
  • Allow writing to token.sent_start. Raise if self.doc.is_parsed == True.
  • Debate how to relax the constraint that you can only break after parsing.
  • Consider changing the features in the tagger, so that the N1 and N2 token features take EOL values when they hit a sentence boundary. This will allow users to make sure that the tagger behaves just as if the string ends there.

If we do it this way, users will be able to insert their own SBD process into the pipeline, after tokenization or tagging, but before parsing. They would also be able to calculate their own parse and sentence boundaries after parsing, and set this information onto the document. However, it won’t be convenient — to do this, they would have to set is_parsed = False, set the token.sent_start attributes, and then assign the parse. It’ll be an annoying recipe, but it’ll work, and it’ll be self-contained.

0reactions
lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stanza not tokenising sentences as expected; can I use ...
I am trying to pre-process my text data for a word alignment task. I have a text file of sentences. Each sentence is...
Read more >
Tokenization & Sentence Segmentation - Stanza
You can perform tokenization without sentence segmentation, as long as the sentences are split by two continuous newlines ( \n\n ) in the...
Read more >
Sentence Boundary Detection: A Long Solved Problem?
Mikheev (2002) treats sentence segmentation with a small set of rules based on determining whether the words to the left or right of...
Read more >
SpaCy Python Tutorial - Sentence Boundary Detection
NLP with SpaCy Python Tutorial Sentence Boundary DetectionIn this tutorial we will be learning about how to do sentence segmentation and how ...
Read more >
Sentence Alignment Step - Okapi Framework
Force Simple One to One Alignment — Set this option so that for each paragraph, if there are the same number of sentences,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found