Question: Is it possible to train a SentenceDetector() to ignore abbreviations?
See original GitHub issueLink to doc page in question (if any): I am using Spark NLP for a German NER Pipeline but am having trouble detecting items that have an abbreviation. Eg.: Given a sentence like ‘Ich lebe in 89192 München Bavariastr. 34 und bin hier gern zu Hause.’ (address is invented)
I would like to extract a zip-code-city-street style address like this ‘89192 München Bavariastr. 34’ but the SentenceDetector()
breaks this into 2 sentences:
‘Ich lebe in 89192 München Bavariastr.’
‘34 und bin hier gern zu Hause.’
which (most likely) makes it rather hard to detect the intended string as an address.
Name of the Spark NLP feature whose docs need improvement:
SentenceDetector()
What you think the docs should say:
It would be great if there was something similar along the lines as I found it for NLTK PunktTrainer or OpenNLP. Even if there was no way of training and applying a custom SentenceDetector()
(which would surprise me) it would be good if there was a statement on that limitation.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Just to circle back on this for one last time:
Training the
SentenceDetectorDLModel
on those special abbreviations as explained above has improved the efficiency of my NER pipeline. The percentage of correctly extracted entries jumped from 93,5% to 99,6% - so I am able to extract text almost error free 😃Its amazing to see how easy this works once you know how its done 😃 I cannot share the actual data results but I’ll be looking for a public German text to maybe write an article on this since I can imagine that many others face similar issues …anyways, thanks for your quick feedbacks it realy is very helpful 👍🏽 🥇
Just for completness sake: I found my error. My last code snippet has a typo because I am using the
SentenceDetector
class to load a model for theSentenceDetectorDLModel
class from disk (now that I’ve seen the error I understand what you meant by the rather generic phrase: ‘You cannot load one into another’).The sentence splitting problem was not resolved with the application of the model 😦 (as I downloaded it from the models hub). But I’ll try to train it and see if this could resolve my problem 😃
Thanx a lot for your quick support and this great library 😃