question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]: Broadcast throws exception on Databricks on AWS.

See original GitHub issue

Describe the bug Broadcast throws exception on Databricks on AWS.

Error Logs

20/10/26 16:43:55 WARN DotnetBackendHandler: cannot find matching method class org.apache.spark.api.python.PythonRDD.setupBroadcast. Candidates are:
20/10/26 16:43:55 WARN DotnetBackendHandler: setupBroadcast(class org.apache.spark.api.java.JavaSparkContext,class java.lang.String)
20/10/26 16:43:55 ERROR DotnetBackendHandler: Failed to execute 'setupBroadcast' on 'NullObject' with args=([Type=java.lang.String, Value: /local_disk0/spark-0b4511c2-19cd-4849-9fe6-583b1ea94fac/sparkdotnet/5w3ohffw.2m0])

HelloSpark Program.cs Code

using Microsoft.Spark;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;

namespace HelloSpark
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();
            var broadcastTest = spark.SparkContext.Broadcast("test");
        }
    }
}

To Reproduce

  1. Create a simple HelloSpark program that just gets the context and does a broadcast.
  2. Setup a jar job using the HelloSpark program in databricks as per the deployment guide: https://github.com/dotnet/spark/tree/master/deployment#using-set-jar
  3. Run the job.
  4. There should be an error in the Standard Output logs complaining about setupBroadcast only existing with two parameters (apache.spark.api.java.JavaSparkContext, java.lang.String) but being called with one (java.lang.String).

Expected behavior Broadcast works without any errors.

Environment Info

  • AWS
  • Databricks Runtime: 6.6 (Scala 2.11, Spark 2.4.5)
  • using jar: microsoft_spark_2_4_2_11_1_0_0.jar
  • Dotnet Spark v1.0.0

Additional Info For whatever reason it seems like in the Databricks environment PythonRDD setupBroadcast takes the javaSparkContext and the path. https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Broadcast.cs#L177

Updating the Broadcast.cs to also pass in the javaSparkContext resolved the issue when running on Databricks but of course breaks it in other enviornments like AWS EMR.

This example uses 2.4, but I have seen the same issues on spark 3.

This may also happen on Azure, but can only confirm on AWS.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
Niharikaduttacommented, Nov 4, 2020

Hi @jonathantonic I was able to reproduce this on AWS as well as Azure Databricks for Spark versions 2.4.5 and 3.0.1. We are working on coming up with a solution/workaround and will update you as soon as we have a plan. Thanks!

1reaction
Niharikaduttacommented, Oct 26, 2020

Thanks @jonathantonic I’m going to repro this bug and get back to you with our suggestions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Broadcast join exceeds threshold, returns out of memory ...
Resolve an Apache Spark OutOfMemorySparkException error that occurs when a table using BroadcastHashJoin exceeds the BroadcastJoinThreshold.
Read more >
Error conditions in Databricks
To avoid throwing an error, provide the parameter IF EXISTS or set the SQL session configuration <config> to <confValue> .
Read more >
Topics with Label: Exception
I'm trying to connect to a cluster with Runtime 13.0 and Unity Catalog through databricks-connect version 13.0.0 (for Python). The spark session ...
Read more >
Databricks Runtime 7.x migration guide
A runtime exception is thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and below,...
Read more >
AWS - Databricks
AWS services fail with a Java "No region provided" error in Databricks Runtime 7.0 and above. ... Fitting an Apache SparkML model throws...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found