question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question: Is it possible to train a SentenceDetector() to ignore abbreviations?

See original GitHub issue

Link to doc page in question (if any): I am using Spark NLP for a German NER Pipeline but am having trouble detecting items that have an abbreviation. Eg.: Given a sentence like ‘Ich lebe in 89192 München Bavariastr. 34 und bin hier gern zu Hause.’ (address is invented)

I would like to extract a zip-code-city-street style address like this ‘89192 München Bavariastr. 34’ but the SentenceDetector() breaks this into 2 sentences: ‘Ich lebe in 89192 München Bavariastr.’ ‘34 und bin hier gern zu Hause.’

which (most likely) makes it rather hard to detect the intended string as an address.

Name of the Spark NLP feature whose docs need improvement: SentenceDetector()

What you think the docs should say: It would be great if there was something similar along the lines as I found it for NLTK PunktTrainer or OpenNLP. Even if there was no way of training and applying a custom SentenceDetector() (which would surprise me) it would be good if there was a statement on that limitation.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
Dirkster99commented, Aug 6, 2021

Just to circle back on this for one last time:

Training the SentenceDetectorDLModel on those special abbreviations as explained above has improved the efficiency of my NER pipeline. The percentage of correctly extracted entries jumped from 93,5% to 99,6% - so I am able to extract text almost error free 😃

Its amazing to see how easy this works once you know how its done 😃 I cannot share the actual data results but I’ll be looking for a public German text to maybe write an article on this since I can imagine that many others face similar issues …anyways, thanks for your quick feedbacks it realy is very helpful 👍🏽 🥇

1reaction
Dirkster99commented, Aug 4, 2021

Just for completness sake: I found my error. My last code snippet has a typo because I am using the SentenceDetector class to load a model for the SentenceDetectorDLModel class from disk (now that I’ve seen the error I understand what you meant by the rather generic phrase: ‘You cannot load one into another’).

The sentence splitting problem was not resolved with the application of the model 😦 (as I downloaded it from the models hub). But I’ll try to train it and see if this could resolve my problem 😃

Thanx a lot for your quick support and this great library 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Annotators - Spark NLP
Model annotators have a pretrained() on it's static object, ... SentenceDetector, Annotator that detects sentence boundaries using regular ...
Read more >
python - How to avoid NLTK's sentence tokenizer splitting on ...
I think lower case for u.s.a in abbreviations list will work fine for you Try this, from nltk.tokenize.punkt import PunktSentenceTokenizer, ...
Read more >
Sentence Splitting and the Scribendi Accelerator
Abbreviations are also an open-ended set, and a sentence splitter must be able to recognize domain-specific abbreviations when they occur ...
Read more >
Customizing the SentenceDetector in Spark NLP | by Dirk Bahle
In this part of the post we are interested in finding sentences that contain abbreviations so we can show them to the model...
Read more >
Apache OpenNLP Developer Documentation
Training options often include number of iterations, cutoff, abbreviations dictionary or something else. Sometimes it is possible to provide these options ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found