Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]: UDF Group Apply on Azure DataBricks causes NRE at ArrowColumnVector.getChild

See original GitHub issue

Describe the bug Application works locally using spark-submit but once deployed to Azure Databricks it throws a java.lang.NullPointerException at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:132) exception using Set Jar job.

The code works fine when testing locally using spark-submit. We’re guessing the issue is related to dependencies on the workers?

We’re pretty sure it’s not a result of null values in the DataFrame. Also, the JDBC sql connection works and prints the schema as expected. It only crashes when dataframe.show() calls the UDF.

            DataFrame termsDF = spark.Read()
                .Jdbc(jdbcUrl, "dbo.CountryPopluations", connectionProperties);
            termsDF.PrintSchema();

To Reproduce

Steps to reproduce the behavior:

Deploy .NET Core 3.1 App using Set Jar instructions
Start cluster and job. Code similar to:

            DataFrame birthRatesDF = countriesDF
                .Select("Id",
                        "PopulationCount",
                        "Year",
                        "CountryId")
                .GroupBy("CountryId")
                .Apply(
                    birthratesStructure,
                    r => CalcBirthRates(r,
                        "Id",
                        "PopulationCount",
                        "CountryId"
                    ));

            birthRatesDF.Show(); //Exception thrown here
            birthRatesDF.Write().Mode(SaveMode.Append).Jdbc(jdbcUrl, "dbo.BirthRates", connectionProperties);
            


#if DEBUG
            //// Stop Spark session, but don't call this in prod on Databricks
            spark.Stop();
#endif
        }

        private static RecordBatch CalcBirthRates(RecordBatch salesRecords,
            string idColumnName,
            string populationCountName,
            string countryIdColumnName
            )
        {
            //Do simple math calculations 


            return new RecordBatch(
                new Schema.Builder()
                    .Field(f => f.Name("countryId").DataType(Arrow.Int32Type.Default))
                    .Field(f => f.Name("popuLationGrowth").DataType(Arrow.StringType.Default))
                    .Field(f => f.Name("year").DataType(Arrow.StringType.Default))
                    .Build(),
                    new IArrowArray[]
                    {
                                    countryIds.Build(),
                                    popuLationGrowths.Build(),
                                    years.Build()
                    },
                    recordCount);
        }

See error:

[Times: user=0.59 sys=0.03, real=0.16 secs] 
 [Full GC (Metadata GC Threshold) [PSYoungGen: 253426K->0K(1552384K)] [ParOldGen: 471161K->505759K(4273664K)] 724588K->505759K(5826048K), [Metaspace: 160795K->159628K(1189888K)], 0.8390849 secs] [Times: user=2.80 sys=0.01, real=0.84 secs] 
[...] [...] [Error] [JvmBridge] JVM method execution failed: Nonstatic method showString failed for class 26 when called with 3 arguments ([Index=1, Type=Int32, Value=20], [Index=2, Type=Int32, Value=20], [Index=3, Type=Boolean, Value=False], )
[..] [...] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.5, executor 1): java.lang.NullPointerException
	at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:132)
	at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2$$anonfun$4.apply(FlatMapGroupsInPandasExec.scala:155)
	at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2$$anonfun$4.apply(FlatMapGroupsInPandasExec.scala:155)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.Range.foreach(Range.scala:160)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(FlatMapGroupsInPandasExec.scala:155)
	at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(FlatMapGroupsInPandasExec.scala:152)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
	at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:159)
	at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:158)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
	at org.apache.spark.scheduler.Task.run(Task.scala:113)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Expected behavior We expect the program to run successfully as it does locally using spark-submit. The database connection is the same for both environments. And as mentioned before we can confirm it connects.

**Environment: *

.NET Core 3.1
Azure Databricks Cluster: 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
Microsoft.Spark.Worker.netcoreapp3.1.linux-x64-0.10.0
microsoft-spark-2.4.x-0.10.0.jar

Issue Analytics

State:
Created 3 years ago
Comments:9 (2 by maintainers)

Top GitHub Comments

2reactions

jammmancommented, Apr 2, 2020

I tried a Azure HD Insight Spark cluster to compare and I can confirm this does work correctly on HDInsight with no code changes - as expected (and hoped), so it does seem to be a bug specific to Azure Databricks

1reaction

jammmancommented, Apr 2, 2020

Thanks @elvaliuliuliu and @imback82 , so far have only tried: