[BUG]: UDF Group Apply on Azure DataBricks causes NRE at ArrowColumnVector.getChild
See original GitHub issueDescribe the bug Application works locally using spark-submit but once deployed to Azure Databricks it throws a java.lang.NullPointerException at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:132) exception using Set Jar job.
The code works fine when testing locally using spark-submit. We’re guessing the issue is related to dependencies on the workers?
We’re pretty sure it’s not a result of null values in the DataFrame. Also, the JDBC sql connection works and prints the schema as expected. It only crashes when dataframe.show() calls the UDF.
DataFrame termsDF = spark.Read()
.Jdbc(jdbcUrl, "dbo.CountryPopluations", connectionProperties);
termsDF.PrintSchema();
To Reproduce
Steps to reproduce the behavior:
- Deploy .NET Core 3.1 App using Set Jar instructions
- Start cluster and job. Code similar to:
DataFrame birthRatesDF = countriesDF
.Select("Id",
"PopulationCount",
"Year",
"CountryId")
.GroupBy("CountryId")
.Apply(
birthratesStructure,
r => CalcBirthRates(r,
"Id",
"PopulationCount",
"CountryId"
));
birthRatesDF.Show(); //Exception thrown here
birthRatesDF.Write().Mode(SaveMode.Append).Jdbc(jdbcUrl, "dbo.BirthRates", connectionProperties);
#if DEBUG
//// Stop Spark session, but don't call this in prod on Databricks
spark.Stop();
#endif
}
private static RecordBatch CalcBirthRates(RecordBatch salesRecords,
string idColumnName,
string populationCountName,
string countryIdColumnName
)
{
//Do simple math calculations
return new RecordBatch(
new Schema.Builder()
.Field(f => f.Name("countryId").DataType(Arrow.Int32Type.Default))
.Field(f => f.Name("popuLationGrowth").DataType(Arrow.StringType.Default))
.Field(f => f.Name("year").DataType(Arrow.StringType.Default))
.Build(),
new IArrowArray[]
{
countryIds.Build(),
popuLationGrowths.Build(),
years.Build()
},
recordCount);
}
- See error:
[Times: user=0.59 sys=0.03, real=0.16 secs]
[Full GC (Metadata GC Threshold) [PSYoungGen: 253426K->0K(1552384K)] [ParOldGen: 471161K->505759K(4273664K)] 724588K->505759K(5826048K), [Metaspace: 160795K->159628K(1189888K)], 0.8390849 secs] [Times: user=2.80 sys=0.01, real=0.84 secs]
[...] [...] [Error] [JvmBridge] JVM method execution failed: Nonstatic method showString failed for class 26 when called with 3 arguments ([Index=1, Type=Int32, Value=20], [Index=2, Type=Int32, Value=20], [Index=3, Type=Boolean, Value=False], )
[..] [...] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.5, executor 1): java.lang.NullPointerException
at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:132)
at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2$$anonfun$4.apply(FlatMapGroupsInPandasExec.scala:155)
at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2$$anonfun$4.apply(FlatMapGroupsInPandasExec.scala:155)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(FlatMapGroupsInPandasExec.scala:155)
at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(FlatMapGroupsInPandasExec.scala:152)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:159)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:158)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Expected behavior We expect the program to run successfully as it does locally using spark-submit. The database connection is the same for both environments. And as mentioned before we can confirm it connects.
**Environment: *
- .NET Core 3.1
- Azure Databricks Cluster: 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
- Microsoft.Spark.Worker.netcoreapp3.1.linux-x64-0.10.0
- microsoft-spark-2.4.x-0.10.0.jar
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (2 by maintainers)
Top GitHub Comments
I tried a Azure HD Insight Spark cluster to compare and I can confirm this does work correctly on HDInsight with no code changes - as expected (and hoped), so it does seem to be a bug specific to Azure Databricks
Thanks @elvaliuliuliu and @imback82 , so far have only tried:
Hope that helps!