Tokenizer : Problems with the compound words start with ?
See original GitHub issueI trained a PosModel with my own Tokenizer, which uses some compound word starting with special characters like ‘?’ or ‘$’ or ‘-’. It complains at the prediction stage with ‘?’, not the training stage.
Description
Expected Behavior
No errors.
Current Behavior
With a compound word starting with ‘?’, at loading the model and annotate with LightPipeline, I have the following stacktrace :
[Lorg.apache.spark.ml.Transformer;@2fa7a7d6
[error] (run-main-0) java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 0
[error] ?!
[error] ^
[error] java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 0
[error] ?!
[error] ^
[error] at java.util.regex.Pattern.error(Pattern.java:1955)
[error] at java.util.regex.Pattern.sequence(Pattern.java:2123)
[error] at java.util.regex.Pattern.expr(Pattern.java:1996)
[error] at java.util.regex.Pattern.compile(Pattern.java:1696)
[error] at java.util.regex.Pattern.<init>(Pattern.java:1351)
[error] at java.util.regex.Pattern.compile(Pattern.java:1028)
[error] at java.lang.String.replaceAll(String.java:2223)
[error] at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1$$anonfun$1$$anonfun$apply$9.apply(Tokenizer.scala:114)
[error] at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1$$anonfun$1$$anonfun$apply$9.apply(Tokenizer.scala:113)
[error] at scala.collection.IndexedSeqOptimized$class.foldr(IndexedSeqOptimized.scala:62)
[error] at scala.collection.IndexedSeqOptimized$class.foldRight(IndexedSeqOptimized.scala:70)
[error] at scala.collection.mutable.ArrayOps$ofRef.foldRight(ArrayOps.scala:186)
[error] at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1$$anonfun$1.apply(Tokenizer.scala:113)
[error] at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1$$anonfun$1.apply(Tokenizer.scala:113)
[error] at scala.Option.map(Option.scala:146)
[error] at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1.apply(Tokenizer.scala:113)
[error] at com.johnsnowlabs.nlp.annotators.Tokenizer$$anonfun$tag$1.apply(Tokenizer.scala:110)
[error] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[error] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[error] at scala.collection.immutable.List.foreach(List.scala:392)
[error] at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
[error] at scala.collection.immutable.List.map(List.scala:296)
[error] at com.johnsnowlabs.nlp.annotators.Tokenizer.tag(Tokenizer.scala:110)
[error] at com.johnsnowlabs.nlp.annotators.Tokenizer.annotate(Tokenizer.scala:159)
[error] at com.johnsnowlabs.nlp.LightPipeline$$anonfun$fullAnnotate$1.apply(LightPipeline.scala:26)
[error] at com.johnsnowlabs.nlp.LightPipeline$$anonfun$fullAnnotate$1.apply(LightPipeline.scala:19)
[error] at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
[error] at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
[error] at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
[error] at com.johnsnowlabs.nlp.LightPipeline.fullAnnotate(LightPipeline.scala:19)
[error] at com.johnsnowlabs.nlp.LightPipeline.annotate(LightPipeline.scala:58)
[error] at com.johnsnowlabs.nlp.LightPipeline$$anonfun$annotate$2.apply(LightPipeline.scala:63)
[error] at com.johnsnowlabs.nlp.LightPipeline$$anonfun$annotate$2.apply(LightPipeline.scala:62)
[error] at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:657)
[error] at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
[error] at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
[error] at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
[error] at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
[error] at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:648)
[error] at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159)
[error] at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
[error] at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
[error] at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
[error] at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
[error] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[error] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[error] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[error] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Possible Solution
I think the compound words has to be quoted for the regex Tokenizer when passing at the training and prediction stage. It’s the enhancement of the Tokenizer.
Steps to Reproduce
- Create a Tokenizer with
.setCompositeTokens(Arrays.asList("?!?!","!?")) - Use it in a Pipeline stage,
fit()it - Create a LightPipeline with the model
annotate()some text with the LightPipeline
Context
I have to use a specific list of compound words for my model. (and in French, there are many ones… ^^)
Your Environment
- Scala + Spark + Spark NLP with SBT
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.7.2",
"org.apache.spark" % "spark-core_2.11" % "2.3.2",
"org.apache.spark" % "spark-mllib_2.11" % "2.3.2",
"org.scalactic" %% "scalactic" % "3.0.5",
"org.scalatest" %% "scalatest" % "3.0.5" % Test
)
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Weaknesses of WordPiece Tokenization | by Rick Battle
The best way to handle compound words is to split them up. You and I don't read compound words as one long word....
Read more >Solr compound word tokenizer - results treated as OR statement
The compound-word-dictionary.txt file holds a list of words that are used to decompound compounded words. In this list you will find for example...
Read more >Tokenization - Stanford NLP Group
These tokens are often loosely referred to as terms or words, but it is sometimes important to make ... These issues of tokenization...
Read more >Overview of tokenization algorithms in NLP | by Ane Berasategi
This article is an overview of tokenization algorithms, ranging from word level, character level and subword level tokenization, ...
Read more >4. Tokenization - Applied Natural Language Processing in the ...
Tokenization This is our first chapter in the section of NLP from the ground up ... grammar, or compound word structure (i.e., the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Yep. I’d like to see if maybe you’re using the wrong tool for the job. Also if want more support, I invite you to join our Slack channel.
Closing. Release 1.7.3 included changes in param functions to indicate this is a regex.