Pretrained Explain Document DL for English no longer seems to work in Scala
See original GitHub issueDescription
I’m currently attempting to use the Explain DL pipeline for English using PretrainedPipeline("explain_document_dl", "en")
. However, when attempting to load the model I receive:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8) (10.0.0.32 executor driver): java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible: stream classdesc serialVersionUID = 1028182004549731694, local class serialVersionUID = 3456489343829468865
at java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
at java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2012)
at java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1862)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2169)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527)
at jdk.internal.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at java.base/java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1175)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2325)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.util.Utils$.deserialize(Utils.scala:133)
at org.apache.spark.SparkContext.$anonfun$objectFile$2(SparkContext.scala:1395)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:268)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at scala.collection.AbstractIterator.to(Iterator.scala:1431)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Expected Behavior
I would expect the pipeline to work, seeing the Explain ML pipeline works, as well as the Explain pipelines I’ve tried for other languages. Explain DL says it supports SparkNLP 3.0.0+, which is what I’m using.
Current Behavior
(see description)
Possible Solution
It seems to be some kind of Scala/Spark version incompatibility – I think if the pipeline was retrained somehow it could possibly work. I was going to train my own version of the explain pipeline, but couldn’t find any resources on training a lemmatization model (there doesn’t seem to be a standalone pre-trained one on the models hub).
Steps to Reproduce
I would imagine setting up a new project with the library versions below and attempting to load the pipeline would result in the same error, seeing the error is coming from inside Spark itself. I just haven’t had the time to setup a reproduction repo.
Context
I’m attempting to pretrain a pipeline that provides everything the DL pipeline does, however, I’m not sure how to obtain certain values (like lemmas) without the use of this pipeline.
Your Environment
- Spark NLP version
sparknlp.version()
: 3.1.3 - Apache NLP version
spark.version
: 3.1.2 - Java version
java -version
: 8 - Scala version: 2.12.13
- Setup and installation (Pypi, Conda, Maven, etc.): Bazel w/ Scala Rules
- Operating System and version: macOS Big Sur 11.4
- Link to your project (if any): N/A
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (5 by maintainers)
Yeah it’s very strange… I just tested on my repro repo with Scala 2.11.12 and 2.12.10 with Spark 3.1.2 and got the error both times. Downgrading to Spark 3.0.3 is the only thing that works (Spark 3.1.1 also throws the error).
I’m also curious if you could share where in the Spark docs it claims 2.11.12 and 2.12.10 are the officially supported Scala versions? All I could find is where it says Spark 3.0 is pre-built with Scala 2.12 and you need a 2.12.x version to run it:
So I actually just did some experimenting – downgrading Spark to 3.0.3 from 3.1.2 in my linked repo works. Reading the release notes for Spark 3.1.1 and 3.1.2 though, I can’t seem to find anything that would imply backward incompatibility.