question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pretrained Explain Document DL for English no longer seems to work in Scala

See original GitHub issue

Description

I’m currently attempting to use the Explain DL pipeline for English using PretrainedPipeline("explain_document_dl", "en"). However, when attempting to load the model I receive:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8) (10.0.0.32 executor driver): java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible: stream classdesc serialVersionUID = 1028182004549731694, local class serialVersionUID = 3456489343829468865
        at java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
        at java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2012)
        at java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1862)
        at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2169)
        at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
        at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
        at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
        at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
        at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
        at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
        at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
        at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
        at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
        at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
        at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
        at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527)
        at jdk.internal.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at java.base/java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1175)
        at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2325)
        at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
        at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
        at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
        at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
        at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
        at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
        at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
        at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
        at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
        at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
        at org.apache.spark.util.Utils$.deserialize(Utils.scala:133)
        at org.apache.spark.SparkContext.$anonfun$objectFile$2(SparkContext.scala:1395)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:268)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
        at scala.collection.AbstractIterator.to(Iterator.scala:1431)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
        at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

Expected Behavior

I would expect the pipeline to work, seeing the Explain ML pipeline works, as well as the Explain pipelines I’ve tried for other languages. Explain DL says it supports SparkNLP 3.0.0+, which is what I’m using.

Current Behavior

(see description)

Possible Solution

It seems to be some kind of Scala/Spark version incompatibility – I think if the pipeline was retrained somehow it could possibly work. I was going to train my own version of the explain pipeline, but couldn’t find any resources on training a lemmatization model (there doesn’t seem to be a standalone pre-trained one on the models hub).

Steps to Reproduce

I would imagine setting up a new project with the library versions below and attempting to load the pipeline would result in the same error, seeing the error is coming from inside Spark itself. I just haven’t had the time to setup a reproduction repo.

Context

I’m attempting to pretrain a pipeline that provides everything the DL pipeline does, however, I’m not sure how to obtain certain values (like lemmas) without the use of this pipeline.

Your Environment

  • Spark NLP version sparknlp.version(): 3.1.3
  • Apache NLP version spark.version: 3.1.2
  • Java version java -version: 8
  • Scala version: 2.12.13
  • Setup and installation (Pypi, Conda, Maven, etc.): Bazel w/ Scala Rules
  • Operating System and version: macOS Big Sur 11.4
  • Link to your project (if any): N/A

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Nickersoftcommented, Jul 23, 2021

Yeah it’s very strange… I just tested on my repro repo with Scala 2.11.12 and 2.12.10 with Spark 3.1.2 and got the error both times. Downgrading to Spark 3.0.3 is the only thing that works (Spark 3.1.1 also throws the error).

I’m also curious if you could share where in the Spark docs it claims 2.11.12 and 2.12.10 are the officially supported Scala versions? All I could find is where it says Spark 3.0 is pre-built with Scala 2.12 and you need a 2.12.x version to run it:

image

1reaction
Nickersoftcommented, Jul 23, 2021

So I actually just did some experimenting – downgrading Spark to 3.0.3 from 3.1.2 in my linked repo works. Reading the release notes for Spark 3.1.1 and 3.1.2 though, I can’t seem to find anything that would imply backward incompatibility.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Explain Document DL Pipeline for English - Spark NLP
The explain_document_dl is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps ...
Read more >
Cannot use MultiClassifierDLApproach Pipeline in GCP #6346
I am trying to implement a MultiLabel - MultiClass Classification as described in the docs for MultiClassifierDLApproach.
Read more >
Python vs. Scala: a comparison of the basic commands (Part I)
I love learning new things but after months of programming with Python, it is just not natural to set that aside and switch...
Read more >
Advanced Natural Language Processing with Apache Spark ...
And then I'll hand it off to Alex, who will go through some notebooks and showing kind of live code of primary use...
Read more >
Write and run Spark Scala jobs on Dataproc - Google Cloud
package compiled Scala classes into a jar file with a manifest; submit the Scala jar to a Spark job that runs on your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found