Forcing sentence segmentation for newlines
See original GitHub issueIs there a way to force sentence segmentation when a newline \n
character is found?
For example,
Hey Honnibal,
This is a great library for 2 reasons:
- It's fast
- It's accurate
This is parsed as 1 sentence using nlp(text)
. However, it’d be great if it was parsed as 4 sentences, because of the newline. E-mail data tends to be in this format.
I’ve searched the docs but couldn’t find anything. Is there a workaround?
Thanks!
Issue Analytics
- State:
- Created 8 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Stanza not tokenising sentences as expected; can I use ...
I am trying to pre-process my text data for a word alignment task. I have a text file of sentences. Each sentence is...
Read more >Tokenization & Sentence Segmentation - Stanza
You can perform tokenization without sentence segmentation, as long as the sentences are split by two continuous newlines ( \n\n ) in the...
Read more >Sentence Boundary Detection: A Long Solved Problem?
Mikheev (2002) treats sentence segmentation with a small set of rules based on determining whether the words to the left or right of...
Read more >SpaCy Python Tutorial - Sentence Boundary Detection
NLP with SpaCy Python Tutorial Sentence Boundary DetectionIn this tutorial we will be learning about how to do sentence segmentation and how ...
Read more >Sentence Alignment Step - Okapi Framework
Force Simple One to One Alignment — Set this option so that for each paragraph, if there are the same number of sentences,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think it’s weird at the moment that there’s really no way to set the sentence boundaries yourself. If we had some way to do that, then users could set the attribute before parsing. Then we just have to make sure the transition system respects the existing data in the token.
The tricky thing is that writing to
token.sent_start
might leave the parse in an invalid state, and I’m not sure I like the idea of quietly fixing up a bunch of other words’.dep
,.head
etc attributes because you wrote to one token’ssent_start
. Adoc.break_sent(i)
method is probably more sensible. On the other hand, ifdoc.is_parsed == False
, then writing the sentence boundary should be unproblematic.We should also have a
SENT_START
attribute. It’s pretty weird that it’s missing — it means there’s no way to export the sentence boundaries into the array at the moment. We would have to decide how to handle it indoc.from_array
though.So, here’s my proposal. It’s designed for now not to break anything, and to be better than the current status quo:
SENT_START
attribute.SENT_START
andHEAD
are being set at the same time indoc.from_array
token.sent_start
. Raise ifself.doc.is_parsed == True
.N1
andN2
token features takeEOL
values when they hit a sentence boundary. This will allow users to make sure that the tagger behaves just as if the string ends there.If we do it this way, users will be able to insert their own SBD process into the pipeline, after tokenization or tagging, but before parsing. They would also be able to calculate their own parse and sentence boundaries after parsing, and set this information onto the document. However, it won’t be convenient — to do this, they would have to set
is_parsed = False
, set thetoken.sent_start
attributes, and then assign the parse. It’ll be an annoying recipe, but it’ll work, and it’ll be self-contained.This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.