Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]: Broadcast throws exception on Databricks on AWS.

See original GitHub issue

Describe the bug Broadcast throws exception on Databricks on AWS.

Error Logs

20/10/26 16:43:55 WARN DotnetBackendHandler: cannot find matching method class org.apache.spark.api.python.PythonRDD.setupBroadcast. Candidates are:
20/10/26 16:43:55 WARN DotnetBackendHandler: setupBroadcast(class org.apache.spark.api.java.JavaSparkContext,class java.lang.String)
20/10/26 16:43:55 ERROR DotnetBackendHandler: Failed to execute 'setupBroadcast' on 'NullObject' with args=([Type=java.lang.String, Value: /local_disk0/spark-0b4511c2-19cd-4849-9fe6-583b1ea94fac/sparkdotnet/5w3ohffw.2m0])

HelloSpark Program.cs Code

using Microsoft.Spark;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;

namespace HelloSpark
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();
            var broadcastTest = spark.SparkContext.Broadcast("test");
        }
    }
}

To Reproduce

Create a simple HelloSpark program that just gets the context and does a broadcast.
Setup a jar job using the HelloSpark program in databricks as per the deployment guide: https://github.com/dotnet/spark/tree/master/deployment#using-set-jar
Run the job.
There should be an error in the Standard Output logs complaining about setupBroadcast only existing with two parameters (apache.spark.api.java.JavaSparkContext, java.lang.String) but being called with one (java.lang.String).

Expected behavior Broadcast works without any errors.

Environment Info

AWS
Databricks Runtime: 6.6 (Scala 2.11, Spark 2.4.5)
using jar: microsoft_spark_2_4_2_11_1_0_0.jar
Dotnet Spark v1.0.0

Additional Info For whatever reason it seems like in the Databricks environment PythonRDD setupBroadcast takes the javaSparkContext and the path. https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Broadcast.cs#L177

Updating the Broadcast.cs to also pass in the javaSparkContext resolved the issue when running on Databricks but of course breaks it in other enviornments like AWS EMR.

This example uses 2.4, but I have seen the same issues on spark 3.

This may also happen on Azure, but can only confirm on AWS.

Issue Analytics

State:
Created 3 years ago
Comments:9

Top GitHub Comments

1reaction

Niharikaduttacommented, Nov 4, 2020

Hi @jonathantonic I was able to reproduce this on AWS as well as Azure Databricks for Spark versions 2.4.5 and 3.0.1. We are working on coming up with a solution/workaround and will update you as soon as we have a plan. Thanks!

1reaction

Niharikaduttacommented, Oct 26, 2020

Thanks @jonathantonic I’m going to repro this bug and get back to you with our suggestions.