Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Beginning and end marker for JavaAnnotation are not correctly aligned with string for \n and double white-spaces

See original GitHub issue

Description

There seems to be some character alignment issue, when extracting JavaAnnotation from strings containing special characters. SparkNLP seems to ignore duplicate white-spaces, as well as newline characters while doing text analysis and extraction.

Steps to Reproduce

Start spark with spark-nlp (i.e. bin/spark-shell start --package JohnSnowLabs:spark-nlp:1.6.3
Run the given script (see below).
Notice that there are two JavaAnnotation objects.
Notice that sentence start/end are not correctly aligned.

Not sure if this is a legitimate issue. It makes sense that one would try to cleanse the data a bit before running any analysis. However little things like this could easily be overlooked, and be hard to track in a production environment. It may be best to keep the original character list, so that there is no ambiguity, or at least allow for this auto-cleansing to happen only via an optional parameter.

Script to run:

import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.ner.NerConverter
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.util.Benchmark
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession

import spark.implicits._
spark.sparkContext.setLogLevel("WARN")

val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")

val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")

val posTagger = PerceptronModel.pretrained().setInputCols("token","sentence").setOutputCol("pos")
val nerTagger = NerDLModel.pretrained().setInputCols("token","document").setOutputCol("ner")
val nerConverter = new NerConverter().setInputCols("document", "token", "ner").setOutputCol("ner_converter")
val finisher = new Finisher().setInputCols("sentence","token","pos","ner_converter").setIncludeMetadata(true).setOutputAsArray(false).setCleanAnnotations(false).setAnnotationSplitSymbol("@").setValueSplitSymbol("#")

val pipeline = new Pipeline().setStages(Array(
    document,
    sentenceDetector,
    token,
    posTagger,
    nerTagger,
    nerConverter,
    finisher
))

import scala.collection.JavaConversions._
val model = pipeline.fit(Seq.empty[String].toDS.toDF("text"))
var lightModel = new LightPipeline(model)
import scala.collection.JavaConversions._
var ner_results = results.get("sentence")
var results = lightModel.fullAnnotateJava("I   have a dog.  He \n\n barks.")
ner_results.foreach(println)

The result got is below:

JavaAnnotation(document,0,12,I have a dog.,{})
JavaAnnotation(document,14,22,He barks.,{})

However, clearly, the first sentence should end at 13, and the next end at 27. Also notice that the double spaces and newline characters have been cleansed out. This looks evident for a sentence, but for entities and part of speeches where we only see a word, the exclusion of special chars are harder to see.