Beginning and end marker for JavaAnnotation are not correctly aligned with string for \n and double white-spaces
See original GitHub issueDescription
There seems to be some character alignment issue, when extracting JavaAnnotation from strings containing special characters. SparkNLP seems to ignore duplicate white-spaces, as well as newline characters while doing text analysis and extraction.
Steps to Reproduce
- Start spark with spark-nlp (i.e. bin/spark-shell start --package JohnSnowLabs:spark-nlp:1.6.3
- Run the given script (see below).
- Notice that there are two JavaAnnotation objects.
- Notice that sentence start/end are not correctly aligned.
Not sure if this is a legitimate issue. It makes sense that one would try to cleanse the data a bit before running any analysis. However little things like this could easily be overlooked, and be hard to track in a production environment. It may be best to keep the original character list, so that there is no ambiguity, or at least allow for this auto-cleansing to happen only via an optional parameter.
Script to run:
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.ner.NerConverter
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.util.Benchmark
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained().setInputCols("token","sentence").setOutputCol("pos")
val nerTagger = NerDLModel.pretrained().setInputCols("token","document").setOutputCol("ner")
val nerConverter = new NerConverter().setInputCols("document", "token", "ner").setOutputCol("ner_converter")
val finisher = new Finisher().setInputCols("sentence","token","pos","ner_converter").setIncludeMetadata(true).setOutputAsArray(false).setCleanAnnotations(false).setAnnotationSplitSymbol("@").setValueSplitSymbol("#")
val pipeline = new Pipeline().setStages(Array(
document,
sentenceDetector,
token,
posTagger,
nerTagger,
nerConverter,
finisher
))
import scala.collection.JavaConversions._
val model = pipeline.fit(Seq.empty[String].toDS.toDF("text"))
var lightModel = new LightPipeline(model)
import scala.collection.JavaConversions._
var ner_results = results.get("sentence")
var results = lightModel.fullAnnotateJava("I have a dog. He \n\n barks.")
ner_results.foreach(println)
The result got is below:
JavaAnnotation(document,0,12,I have a dog.,{})
JavaAnnotation(document,14,22,He barks.,{})
However, clearly, the first sentence should end at 13, and the next end at 27. Also notice that the double spaces and newline characters have been cleansed out. This looks evident for a sentence, but for entities and part of speeches where we only see a word, the exclusion of special chars are harder to see.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:5 (2 by maintainers)

Top Related StackOverflow Question
Also if relevant, it might be worth backporting this fix so that other users can benefit from the discrepancy.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days