question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer : Problems with the compound words start with ?

See original GitHub issue

I trained a PosModel with my own Tokenizer, which uses some compound word starting with special characters like ‘?’ or ‘$’ or ‘-’. It complains at the prediction stage with ‘?’, not the training stage.

Description

Expected Behavior

No errors.

Current Behavior

With a compound word starting with ‘?’, at loading the model and annotate with LightPipeline, I have the following stacktrace :

[Lorg.apache.spark.ml.Transformer;@2fa7a7d6
[error] (run-main-0) java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 0
[error] ?!
[error] ^
[error] java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 0
[error] ?!
[error] ^
[error] 	at java.util.regex.Pattern.error(Pattern.java:1955)
[error] 	at java.util.regex.Pattern.sequence(Pattern.java:2123)
[error] 	at java.util.regex.Pattern.expr(Pattern.java:1996)
[error] 	at java.util.regex.Pattern.compile(Pattern.java:1696)
[error] 	at java.util.regex.Pattern.<init>(Pattern.java:1351)
[error] 	at java.util.regex.Pattern.compile(Pattern.java:1028)
[error] 	at java.lang.String.replaceAll(String.java:2223)
[error] 	at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1$$anonfun$1$$anonfun$apply$9.apply(Tokenizer.scala:114)
[error] 	at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1$$anonfun$1$$anonfun$apply$9.apply(Tokenizer.scala:113)
[error] 	at scala.collection.IndexedSeqOptimized$class.foldr(IndexedSeqOptimized.scala:62)
[error] 	at scala.collection.IndexedSeqOptimized$class.foldRight(IndexedSeqOptimized.scala:70)
[error] 	at scala.collection.mutable.ArrayOps$ofRef.foldRight(ArrayOps.scala:186)
[error] 	at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1$$anonfun$1.apply(Tokenizer.scala:113)
[error] 	at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1$$anonfun$1.apply(Tokenizer.scala:113)
[error] 	at scala.Option.map(Option.scala:146)
[error] 	at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1.apply(Tokenizer.scala:113)
[error] 	at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1.apply(Tokenizer.scala:110)
[error] 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[error] 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[error] 	at scala.collection.immutable.List.foreach(List.scala:392)
[error] 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
[error] 	at scala.collection.immutable.List.map(List.scala:296)
[error] 	at com.johnsnowlabs.nlp.annotators.Tokenizer.tag(Tokenizer.scala:110)
[error] 	at com.johnsnowlabs.nlp.annotators.Tokenizer.annotate(Tokenizer.scala:159)
[error] 	at com.johnsnowlabs.nlp.LightPipeline$$anonfun$fullAnnotate$1.apply(LightPipeline.scala:26)
[error] 	at com.johnsnowlabs.nlp.LightPipeline$$anonfun$fullAnnotate$1.apply(LightPipeline.scala:19)
[error] 	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
[error] 	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
[error] 	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
[error] 	at com.johnsnowlabs.nlp.LightPipeline.fullAnnotate(LightPipeline.scala:19)
[error] 	at com.johnsnowlabs.nlp.LightPipeline.annotate(LightPipeline.scala:58)
[error] 	at com.johnsnowlabs.nlp.LightPipeline$$anonfun$annotate$2.apply(LightPipeline.scala:63)
[error] 	at com.johnsnowlabs.nlp.LightPipeline$$anonfun$annotate$2.apply(LightPipeline.scala:62)
[error] 	at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:657)
[error] 	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
[error] 	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
[error] 	at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
[error] 	at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
[error] 	at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:648)
[error] 	at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159)
[error] 	at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
[error] 	at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
[error] 	at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
[error] 	at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
[error] 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[error] 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[error] 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[error] 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Possible Solution

I think the compound words has to be quoted for the regex Tokenizer when passing at the training and prediction stage. It’s the enhancement of the Tokenizer.

Steps to Reproduce

  1. Create a Tokenizer with .setCompositeTokens(Arrays.asList("?!?!","!?"))
  2. Use it in a Pipeline stage, fit() it
  3. Create a LightPipeline with the model
  4. annotate() some text with the LightPipeline

Context

I have to use a specific list of compound words for my model. (and in French, there are many ones… ^^)

Your Environment

  • Scala + Spark + Spark NLP with SBT
scalaVersion := "2.11.12"

libraryDependencies ++= Seq(
  "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.7.2",
  "org.apache.spark" % "spark-core_2.11" % "2.3.2",
  "org.apache.spark" % "spark-mllib_2.11" % "2.3.2",
  "org.scalactic" %% "scalactic" % "3.0.5",
  "org.scalatest" %% "scalatest" % "3.0.5" % Test
)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
albertoandreottiATgmailcommented, Nov 8, 2018

Yep. I’d like to see if maybe you’re using the wrong tool for the job. Also if want more support, I invite you to join our Slack channel.

0reactions
saif-ellaficommented, Nov 19, 2018

Closing. Release 1.7.3 included changes in param functions to indicate this is a regex.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Weaknesses of WordPiece Tokenization | by Rick Battle
The best way to handle compound words is to split them up. You and I don't read compound words as one long word....
Read more >
Solr compound word tokenizer - results treated as OR statement
The compound-word-dictionary.txt file holds a list of words that are used to decompound compounded words. In this list you will find for example...
Read more >
Tokenization - Stanford NLP Group
These tokens are often loosely referred to as terms or words, but it is sometimes important to make ... These issues of tokenization...
Read more >
Overview of tokenization algorithms in NLP | by Ane Berasategi
This article is an overview of tokenization algorithms, ranging from word level, character level and subword level tokenization, ...
Read more >
4. Tokenization - Applied Natural Language Processing in the ...
Tokenization This is our first chapter in the section of NLP from the ground up ... grammar, or compound word structure (i.e., the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found