question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]: UDF Group Apply on Azure DataBricks causes NRE at ArrowColumnVector.getChild

See original GitHub issue

Describe the bug Application works locally using spark-submit but once deployed to Azure Databricks it throws a java.lang.NullPointerException at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:132) exception using Set Jar job.

The code works fine when testing locally using spark-submit. We’re guessing the issue is related to dependencies on the workers?

We’re pretty sure it’s not a result of null values in the DataFrame. Also, the JDBC sql connection works and prints the schema as expected. It only crashes when dataframe.show() calls the UDF.

            DataFrame termsDF = spark.Read()
                .Jdbc(jdbcUrl, "dbo.CountryPopluations", connectionProperties);
            termsDF.PrintSchema();

To Reproduce

Steps to reproduce the behavior:

  1. Deploy .NET Core 3.1 App using Set Jar instructions
  2. Start cluster and job. Code similar to:
            DataFrame birthRatesDF = countriesDF
                .Select("Id",
                        "PopulationCount",
                        "Year",
                        "CountryId")
                .GroupBy("CountryId")
                .Apply(
                    birthratesStructure,
                    r => CalcBirthRates(r,
                        "Id",
                        "PopulationCount",
                        "CountryId"
                    ));

            birthRatesDF.Show(); //Exception thrown here
            birthRatesDF.Write().Mode(SaveMode.Append).Jdbc(jdbcUrl, "dbo.BirthRates", connectionProperties);
            


#if DEBUG
            //// Stop Spark session, but don't call this in prod on Databricks
            spark.Stop();
#endif
        }

        private static RecordBatch CalcBirthRates(RecordBatch salesRecords,
            string idColumnName,
            string populationCountName,
            string countryIdColumnName
            )
        {
            //Do simple math calculations 


            return new RecordBatch(
                new Schema.Builder()
                    .Field(f => f.Name("countryId").DataType(Arrow.Int32Type.Default))
                    .Field(f => f.Name("popuLationGrowth").DataType(Arrow.StringType.Default))
                    .Field(f => f.Name("year").DataType(Arrow.StringType.Default))
                    .Build(),
                    new IArrowArray[]
                    {
                                    countryIds.Build(),
                                    popuLationGrowths.Build(),
                                    years.Build()
                    },
                    recordCount);
        }
  1. See error:
[Times: user=0.59 sys=0.03, real=0.16 secs] 
 [Full GC (Metadata GC Threshold) [PSYoungGen: 253426K->0K(1552384K)] [ParOldGen: 471161K->505759K(4273664K)] 724588K->505759K(5826048K), [Metaspace: 160795K->159628K(1189888K)], 0.8390849 secs] [Times: user=2.80 sys=0.01, real=0.84 secs] 
[...] [...] [Error] [JvmBridge] JVM method execution failed: Nonstatic method showString failed for class 26 when called with 3 arguments ([Index=1, Type=Int32, Value=20], [Index=2, Type=Int32, Value=20], [Index=3, Type=Boolean, Value=False], )
[..] [...] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.5, executor 1): java.lang.NullPointerException
	at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:132)
	at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2$$anonfun$4.apply(FlatMapGroupsInPandasExec.scala:155)
	at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2$$anonfun$4.apply(FlatMapGroupsInPandasExec.scala:155)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.Range.foreach(Range.scala:160)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(FlatMapGroupsInPandasExec.scala:155)
	at org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(FlatMapGroupsInPandasExec.scala:152)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
	at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:159)
	at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:158)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
	at org.apache.spark.scheduler.Task.run(Task.scala:113)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Expected behavior We expect the program to run successfully as it does locally using spark-submit. The database connection is the same for both environments. And as mentioned before we can confirm it connects.

**Environment: *

  • .NET Core 3.1
  • Azure Databricks Cluster: 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
  • Microsoft.Spark.Worker.netcoreapp3.1.linux-x64-0.10.0
  • microsoft-spark-2.4.x-0.10.0.jar

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
jammmancommented, Apr 2, 2020

I tried a Azure HD Insight Spark cluster to compare and I can confirm this does work correctly on HDInsight with no code changes - as expected (and hoped), so it does seem to be a bug specific to Azure Databricks

1reaction
jammmancommented, Apr 2, 2020

Thanks @elvaliuliuliu and @imback82 , so far have only tried:

  • Local, spark-2.4.1-bin-hadoop2.7, successful
  • Local, spark-2.4.5-bin-hadoop2.7, successful
  • Databricks 6.4 (includes Apache Spark 2.4.5, Scala 2.11), failed
  • HDInsight 2.4, successful

Hope that helps!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Mastering Spark SQL PDF
Spark SQL is a new module in Apache Spark that integrates relational processing with. Spark's functional programming API.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found