question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Beginning and end marker for JavaAnnotation are not correctly aligned with string for \n and double white-spaces

See original GitHub issue

Description

There seems to be some character alignment issue, when extracting JavaAnnotation from strings containing special characters. SparkNLP seems to ignore duplicate white-spaces, as well as newline characters while doing text analysis and extraction.

Steps to Reproduce

  1. Start spark with spark-nlp (i.e. bin/spark-shell start --package JohnSnowLabs:spark-nlp:1.6.3
  2. Run the given script (see below).
  3. Notice that there are two JavaAnnotation objects.
  4. Notice that sentence start/end are not correctly aligned.

Not sure if this is a legitimate issue. It makes sense that one would try to cleanse the data a bit before running any analysis. However little things like this could easily be overlooked, and be hard to track in a production environment. It may be best to keep the original character list, so that there is no ambiguity, or at least allow for this auto-cleansing to happen only via an optional parameter.

Script to run:

import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.ner.NerConverter
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.util.Benchmark
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession

import spark.implicits._
spark.sparkContext.setLogLevel("WARN")

val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")

val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")

val posTagger = PerceptronModel.pretrained().setInputCols("token","sentence").setOutputCol("pos")
val nerTagger = NerDLModel.pretrained().setInputCols("token","document").setOutputCol("ner")
val nerConverter = new NerConverter().setInputCols("document", "token", "ner").setOutputCol("ner_converter")
val finisher = new Finisher().setInputCols("sentence","token","pos","ner_converter").setIncludeMetadata(true).setOutputAsArray(false).setCleanAnnotations(false).setAnnotationSplitSymbol("@").setValueSplitSymbol("#")

val pipeline = new Pipeline().setStages(Array(
    document,
    sentenceDetector,
    token,
    posTagger,
    nerTagger,
    nerConverter,
    finisher
))

import scala.collection.JavaConversions._
val model = pipeline.fit(Seq.empty[String].toDS.toDF("text"))
var lightModel = new LightPipeline(model)
import scala.collection.JavaConversions._
var ner_results = results.get("sentence")
var results = lightModel.fullAnnotateJava("I   have a dog.  He \n\n barks.")
ner_results.foreach(println)

The result got is below:

JavaAnnotation(document,0,12,I have a dog.,{})
JavaAnnotation(document,14,22,He barks.,{})

However, clearly, the first sentence should end at 13, and the next end at 27. Also notice that the double spaces and newline characters have been cleansed out. This looks evident for a sentence, but for entities and part of speeches where we only see a word, the exclusion of special chars are harder to see.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:2
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
apiltamangcommented, Sep 17, 2018

Also if relevant, it might be worth backporting this fix so that other users can benefit from the discrepancy.

0reactions
github-actions[bot]commented, Mar 9, 2022

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

Read more comments on GitHub >

github_iconTop Results From Across the Web

Regex to find double whitespace not on the start of a line
I'm struggling with old "formatted" code, where a lot of whitespace is added to line up =' ...
Read more >
white-space - CSS: Cascading Style Sheets - MDN Web Docs
The white-space CSS property sets how white space inside an element is ... Usually, it means reducing sequences of multiple white-space ...
Read more >
Google Java Style Guide
1 Introduction. This document serves as the complete definition of Google's coding standards for source code in the Java™ Programming Language.
Read more >
Drools Documentation - Red Hat on GitHub
You can define a constraint on a String field for an empty String or white-space by delimiting it with double-quotation marks. The enclosing...
Read more >
Formatting - Coding Style - Read the Docs
Use \n as the new line sign only, no \r\n or \r . Each line of text in your code should be at...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found