question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmenting sentences at colons

See original GitHub issue

For example the following snippet will be extracted as one single sentence (ending at the last full stop), but it should perhaps be split at the colons.

Here they “warn” anyone who opposes his radical ideology:
Four police officers were sent to hospital:
Violence against police officers is not only acceptable with Bernie Sanders and Black Lives Matter terrorists, its necessary to create chaos and panic:
What kind of violent protest would be complete without Barack Obama’s good friend, domestic terrorist Bill Ayers:
It’s probably just a coincidence that on a day that <u><b>Obama</b></u> was too busy to attend Nancy Reagan’s funeral, he was able to address a crowd about his hate for Trump only hours before this organized chaos in Chicago:
And finally, we’re wondering how much our Organizer In Chief had to do with this Alinsky style chaos in Chicago:
Illegal aliens, paid Soros protesters, angry Black Lives Matter terrorists inspired by Obama’s race war and Bernie Sanders supporters who have absolutely no idea why they showed up, sent four innocent police officers to the hospital; prevented thousands of innocent Americans from exercising their First Amendment right.

Is this by intention? Is there a way to force splitting at colons? Besides this extreme example I think I came across many cases where syntok did not split at colons.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
fnlcommented, Apr 28, 2020

Release 1.3.1 now supports semi-colon segmentation.

I will leave this ticket open, however, as this was specifically about segmenting colons.

1reaction
fnlcommented, Jan 22, 2020

Thank you, Felix, for bringing this up; A valid feature request: Colon (and semi-colon) handling is indeed a bit of a borderline affair, and technically they are sentence separators. It might make sense to support that, but I need to think about it a bit more. I’d also love to hear feedback/oppinions from other users about this.

[Correcting the title of and adding labels.]

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Perform Sentence Segmentation or Sentence ...
Sentence Segmentation or Sentence Tokenization is the process of identifying different sentences among group of words.
Read more >
Sentence Segmentation - Khulood Nasher - Medium
Sentence segmentation is the analysis of texts based on sentences. In NLP analysis, we either analyze the text data based on meaningful words...
Read more >
Chapter 2: Tokenisation and Sentence Segmentation
Sentence segmentation is the process of determining the longer processing units consisting of one or more words. This task involves identifying sentence ......
Read more >
The Colon Hypothesis: Word Order, Discourse Segmentation ...
Part II (Discourse segmentation) presents a dossier of criteria for segmenting Greek sentences into Kola, followed by a handful of case studies. Part...
Read more >
Perform sentence segmentation on paragraphs without ...
I want to know if there was any method to segment text into sentences when periods, semi-colons, capitalization, etc. are missing.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found