Using kotlin-spark-api interactively in a Kotlin Jupyter notebook?
See original GitHub issueHi,
Thank you for your work on the kotlin-spark-api. It looks very promising!
I’m trying to use the kotlin-spark-api interactively in a Kotlin Jupyter notebook. A Spark task run in a notebook cell from within a withSpark
function works fine:
withSpark {
dsOf("a" to 1, "b" to 2, "c" to 3, "d" to 4)
.filter { it.second <= 2 }
.show()
}
The result of .show()
doesn’t appear in the Jupyter cell, but I can see the correct output in logs in the terminal:
+-----+------+
|first|second|
+-----+------+
| a| 1|
| b| 2|
+-----+------+
However, if I try to do the same task interactively in the notebook, I get an error when using filter
with a lambda function:
val spark = SparkSession
.builder()
.master("local[2]")
.appName("Simple Application").orCreate
val ds = spark.dsOf("a" to 1, "b" to 2, "c" to 3, "d" to 4)
ds.count() # ==> 4
ds.filter { it.second <= 2 }.show() # ==> 💣
The full error message is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 3, 192.168.1.202, executor driver): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2136)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
scala.Option.foreach(Option.scala:407)
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2141)
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752)
org.apache.spark.SparkContext.runJob(SparkContext.scala:2093)
org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
org.apache.spark.SparkContext.runJob(SparkContext.scala:2133)
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3625)
org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
org.apache.spark.sql.Dataset.show(Dataset.scala:824)
org.apache.spark.sql.Dataset.show(Dataset.scala:783)
org.apache.spark.sql.Dataset.show(Dataset.scala:792)
Line_22_jupyter.<init>(Line_22.jupyter.kts:1)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:96)
kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:41)
kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:51)
org.jetbrains.kotlin.jupyter.ReplForJupyterImpl$doEval$resultWithDiagnostics$1.invokeSuspend(repl.kt:525)
kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:56)
kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:84)
kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
org.jetbrains.kotlin.jupyter.ReplForJupyterImpl.doEval(repl.kt:525)
org.jetbrains.kotlin.jupyter.ReplForJupyterImpl.eval(repl.kt:366)
org.jetbrains.kotlin.jupyter.ProtocolKt$shellMessagesHandler$res$1.invoke(protocol.kt:138)
org.jetbrains.kotlin.jupyter.ProtocolKt$shellMessagesHandler$res$1.invoke(protocol.kt)
org.jetbrains.kotlin.jupyter.ProtocolKt.evalWithIO(protocol.kt:351)
org.jetbrains.kotlin.jupyter.ProtocolKt.shellMessagesHandler(protocol.kt:137)
org.jetbrains.kotlin.jupyter.IkotlinKt.kernelServer(ikotlin.kt:107)
org.jetbrains.kotlin.jupyter.IkotlinKt.main(ikotlin.kt:69)
I found a report of a seemingly very similar error from someone trying to use Spark in an Almond Scala Jupyter notebook.
I’m just beginning to use Kotlin and the JVM, so I’m not sure why the version run with withSpark
works but the interactive version does not.
Thank you!
Todd
Here are the details of my Jupyter notebook environment:
- MacOS 10.15.6
- Java 1.8.0_144
- kotlin-jupyter-kernel 0.8.2.5
- kotlin-spark-api 0.3.0
- spark-sql_2.12 3.0.0
- jupyterlab 2.2.2
- KotlinVersion.CURRENT 1.4.20
Issue Analytics
- State:
- Created 3 years ago
- Comments:7
Hi @ileasile !
Thank you! Your suggestion fixes my problem. 😄
I wasn’t using the
%use spark
magic because I wanted to trykotlin-spark-api
. Perhaps we should add an additional library descriptor tokotlin-jupyter
that is based onkotlin-spark-api
?Todd
@ToddSmall yes) I’ve already asked @asm0dey to provide one.