Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Forcing sentence segmentation for newlines

See original GitHub issue

Is there a way to force sentence segmentation when a newline \n character is found?

For example,

Hey Honnibal,

This is a great library for 2 reasons:
 - It's fast
 - It's accurate

This is parsed as 1 sentence using nlp(text). However, it’d be great if it was parsed as 4 sentences, because of the newline. E-mail data tends to be in this format.

I’ve searched the docs but couldn’t find anything. Is there a workaround?

Thanks!

Issue Analytics

State:
Created 8 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

honnibalcommented, May 5, 2016

I think it’s weird at the moment that there’s really no way to set the sentence boundaries yourself. If we had some way to do that, then users could set the attribute before parsing. Then we just have to make sure the transition system respects the existing data in the token.

The tricky thing is that writing to token.sent_start might leave the parse in an invalid state, and I’m not sure I like the idea of quietly fixing up a bunch of other words’ .dep, .head etc attributes because you wrote to one token’s sent_start. A doc.break_sent(i) method is probably more sensible. On the other hand, if doc.is_parsed == False, then writing the sentence boundary should be unproblematic.

We should also have a SENT_START attribute. It’s pretty weird that it’s missing — it means there’s no way to export the sentence boundaries into the array at the moment. We would have to decide how to handle it in doc.from_array though.

So, here’s my proposal. It’s designed for now not to break anything, and to be better than the current status quo:

Add a SENT_START attribute.
Raise error if both SENT_START and HEAD are being set at the same time in doc.from_array
Allow writing to token.sent_start. Raise if self.doc.is_parsed == True.
Debate how to relax the constraint that you can only break after parsing.
Consider changing the features in the tagger, so that the N1 and N2 token features take EOL values when they hit a sentence boundary. This will allow users to make sure that the tagger behaves just as if the string ends there.

If we do it this way, users will be able to insert their own SBD process into the pipeline, after tokenization or tagging, but before parsing. They would also be able to calculate their own parse and sentence boundaries after parsing, and set this information onto the document. However, it won’t be convenient — to do this, they would have to set is_parsed = False, set the token.sent_start attributes, and then assign the parse. It’ll be an annoying recipe, but it’ll work, and it’ll be self-contained.

0reactions

lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.