question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using kotlin-spark-api interactively in a Kotlin Jupyter notebook?

See original GitHub issue

Hi,

Thank you for your work on the kotlin-spark-api. It looks very promising!

I’m trying to use the kotlin-spark-api interactively in a Kotlin Jupyter notebook. A Spark task run in a notebook cell from within a withSpark function works fine:

withSpark {
    dsOf("a" to 1, "b" to 2, "c" to 3, "d" to 4)
        .filter { it.second <= 2 }
        .show()
}

The result of .show() doesn’t appear in the Jupyter cell, but I can see the correct output in logs in the terminal:

+-----+------+
|first|second|
+-----+------+
|    a|     1|
|    b|     2|
+-----+------+

However, if I try to do the same task interactively in the notebook, I get an error when using filter with a lambda function:

val spark = SparkSession
        .builder()
        .master("local[2]")
        .appName("Simple Application").orCreate

val ds = spark.dsOf("a" to 1, "b" to 2, "c" to 3, "d" to 4)

ds.count() # ==> 4

ds.filter { it.second <= 2 }.show() # ==> 💣

The full error message is:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 3, 192.168.1.202, executor driver): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
	at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
	at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
	at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
	at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2136)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
scala.Option.foreach(Option.scala:407)
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2141)
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752)
org.apache.spark.SparkContext.runJob(SparkContext.scala:2093)
org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
org.apache.spark.SparkContext.runJob(SparkContext.scala:2133)
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3625)
org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
org.apache.spark.sql.Dataset.show(Dataset.scala:824)
org.apache.spark.sql.Dataset.show(Dataset.scala:783)
org.apache.spark.sql.Dataset.show(Dataset.scala:792)
Line_22_jupyter.<init>(Line_22.jupyter.kts:1)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:96)
kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:41)
kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:51)
org.jetbrains.kotlin.jupyter.ReplForJupyterImpl$doEval$resultWithDiagnostics$1.invokeSuspend(repl.kt:525)
kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:56)
kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:84)
kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
org.jetbrains.kotlin.jupyter.ReplForJupyterImpl.doEval(repl.kt:525)
org.jetbrains.kotlin.jupyter.ReplForJupyterImpl.eval(repl.kt:366)
org.jetbrains.kotlin.jupyter.ProtocolKt$shellMessagesHandler$res$1.invoke(protocol.kt:138)
org.jetbrains.kotlin.jupyter.ProtocolKt$shellMessagesHandler$res$1.invoke(protocol.kt)
org.jetbrains.kotlin.jupyter.ProtocolKt.evalWithIO(protocol.kt:351)
org.jetbrains.kotlin.jupyter.ProtocolKt.shellMessagesHandler(protocol.kt:137)
org.jetbrains.kotlin.jupyter.IkotlinKt.kernelServer(ikotlin.kt:107)
org.jetbrains.kotlin.jupyter.IkotlinKt.main(ikotlin.kt:69)

I found a report of a seemingly very similar error from someone trying to use Spark in an Almond Scala Jupyter notebook.

I’m just beginning to use Kotlin and the JVM, so I’m not sure why the version run with withSpark works but the interactive version does not.

Thank you!

Todd

Here are the details of my Jupyter notebook environment:

  • MacOS 10.15.6
  • Java 1.8.0_144
  • kotlin-jupyter-kernel 0.8.2.5
  • kotlin-spark-api 0.3.0
  • spark-sql_2.12 3.0.0
  • jupyterlab 2.2.2
  • KotlinVersion.CURRENT 1.4.20

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7

github_iconTop GitHub Comments

2reactions
ToddSmallcommented, Aug 3, 2020

Hi @ileasile !

Thank you! Your suggestion fixes my problem. 😄

I wasn’t using the %use spark magic because I wanted to try kotlin-spark-api. Perhaps we should add an additional library descriptor to kotlin-jupyter that is based on kotlin-spark-api?

Todd

1reaction
ileasilecommented, Aug 3, 2020

@ToddSmall yes) I’ve already asked @asm0dey to provide one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kotlin Kernel for Jupyter Notebook, v0.9.0 - The JetBrains Blog
This update of the Kotlin kernel for Jupyter Notebook primarily targets library authors and enables them to easily integrate Kotlin ...
Read more >
Kotlin kernel for IPython/Jupyter - GitHub
To start using kotlin kernel inside Jupyter Notebook or JupyterLab create a new notebook with kotlin kernel. The default kernel will use the...
Read more >
Beginning Data Science with Jupyter Notebook and Kotlin
This tutorial introduces the concepts of Data Science, using Jupyter Notebook and Kotlin. You'll learn how to set up a Jupyter notebook, ...
Read more >
KotlinForData - Twitter
Introducing Kotlin Spark API v1.2! It brings support for MLLib, RDD, UDTs, and UDFs and now works with all versions of Scala and...
Read more >
A first dive into kotlin-jupyter - Level Up Coding
More specifically, kotlin-jupyter — an interactive notebook environment for solving simple ... Or why I vow not to use Python again… again.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found