Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question: Is it possible to train a SentenceDetector() to ignore abbreviations?

See original GitHub issue

Link to doc page in question (if any): I am using Spark NLP for a German NER Pipeline but am having trouble detecting items that have an abbreviation. Eg.: Given a sentence like ‘Ich lebe in 89192 München Bavariastr. 34 und bin hier gern zu Hause.’ (address is invented)

I would like to extract a zip-code-city-street style address like this ‘89192 München Bavariastr. 34’ but the SentenceDetector() breaks this into 2 sentences: ‘Ich lebe in 89192 München Bavariastr.’ ‘34 und bin hier gern zu Hause.’

which (most likely) makes it rather hard to detect the intended string as an address.

Name of the Spark NLP feature whose docs need improvement: SentenceDetector()

What you think the docs should say: It would be great if there was something similar along the lines as I found it for NLTK PunktTrainer or OpenNLP. Even if there was no way of training and applying a custom SentenceDetector() (which would surprise me) it would be good if there was a statement on that limitation.

Issue Analytics

State:
Created 2 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

Dirkster99commented, Aug 6, 2021

Just to circle back on this for one last time:

Training the SentenceDetectorDLModel on those special abbreviations as explained above has improved the efficiency of my NER pipeline. The percentage of correctly extracted entries jumped from 93,5% to 99,6% - so I am able to extract text almost error free 😃

Its amazing to see how easy this works once you know how its done 😃 I cannot share the actual data results but I’ll be looking for a public German text to maybe write an article on this since I can imagine that many others face similar issues …anyways, thanks for your quick feedbacks it realy is very helpful 👍🏽 🥇

1reaction

Dirkster99commented, Aug 4, 2021

Just for completness sake: I found my error. My last code snippet has a typo because I am using the SentenceDetector class to load a model for the SentenceDetectorDLModel class from disk (now that I’ve seen the error I understand what you meant by the rather generic phrase: ‘You cannot load one into another’).

The sentence splitting problem was not resolved with the application of the model 😦 (as I downloaded it from the models hub). But I’ll try to train it and see if this could resolve my problem 😃

Thanx a lot for your quick support and this great library 😃

Top Results From Across the Web

Annotators - Spark NLP

Model annotators have a pretrained() on it's static object, ... SentenceDetector, Annotator that detects sentence boundaries using regular ...

python - How to avoid NLTK's sentence tokenizer splitting on ...

I think lower case for u.s.a in abbreviations list will work fine for you Try this, from nltk.tokenize.punkt import PunktSentenceTokenizer, ...

Sentence Splitting and the Scribendi Accelerator

Abbreviations are also an open-ended set, and a sentence splitter must be able to recognize domain-specific abbreviations when they occur ...

Customizing the SentenceDetector in Spark NLP | by Dirk Bahle

In this part of the post we are interested in finding sentences that contain abbreviations so we can show them to the model...

Apache OpenNLP Developer Documentation

Training options often include number of iterations, cutoff, abbreviations dictionary or something else. Sometimes it is possible to provide these options ...