[BUG]: Broadcast throws exception on Databricks on AWS.
See original GitHub issueDescribe the bug Broadcast throws exception on Databricks on AWS.
Error Logs
20/10/26 16:43:55 WARN DotnetBackendHandler: cannot find matching method class org.apache.spark.api.python.PythonRDD.setupBroadcast. Candidates are:
20/10/26 16:43:55 WARN DotnetBackendHandler: setupBroadcast(class org.apache.spark.api.java.JavaSparkContext,class java.lang.String)
20/10/26 16:43:55 ERROR DotnetBackendHandler: Failed to execute 'setupBroadcast' on 'NullObject' with args=([Type=java.lang.String, Value: /local_disk0/spark-0b4511c2-19cd-4849-9fe6-583b1ea94fac/sparkdotnet/5w3ohffw.2m0])
HelloSpark Program.cs Code
using Microsoft.Spark;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
namespace HelloSpark
{
class Program
{
static void Main(string[] args)
{
var spark = SparkSession.Builder().GetOrCreate();
var broadcastTest = spark.SparkContext.Broadcast("test");
}
}
}
To Reproduce
- Create a simple HelloSpark program that just gets the context and does a broadcast.
- Setup a jar job using the HelloSpark program in databricks as per the deployment guide: https://github.com/dotnet/spark/tree/master/deployment#using-set-jar
- Run the job.
- There should be an error in the Standard Output logs complaining about setupBroadcast only existing with two parameters (apache.spark.api.java.JavaSparkContext, java.lang.String) but being called with one (java.lang.String).
Expected behavior Broadcast works without any errors.
Environment Info
- AWS
- Databricks Runtime: 6.6 (Scala 2.11, Spark 2.4.5)
- using jar: microsoft_spark_2_4_2_11_1_0_0.jar
- Dotnet Spark v1.0.0
Additional Info For whatever reason it seems like in the Databricks environment PythonRDD setupBroadcast takes the javaSparkContext and the path. https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Broadcast.cs#L177
Updating the Broadcast.cs to also pass in the javaSparkContext resolved the issue when running on Databricks but of course breaks it in other enviornments like AWS EMR.
This example uses 2.4, but I have seen the same issues on spark 3.
This may also happen on Azure, but can only confirm on AWS.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9
Top Results From Across the Web
Broadcast join exceeds threshold, returns out of memory ...
Resolve an Apache Spark OutOfMemorySparkException error that occurs when a table using BroadcastHashJoin exceeds the BroadcastJoinThreshold.
Read more >Error conditions in Databricks
To avoid throwing an error, provide the parameter IF EXISTS or set the SQL session configuration <config> to <confValue> .
Read more >Topics with Label: Exception
I'm trying to connect to a cluster with Runtime 13.0 and Unity Catalog through databricks-connect version 13.0.0 (for Python). The spark session ...
Read more >Databricks Runtime 7.x migration guide
A runtime exception is thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and below,...
Read more >AWS - Databricks
AWS services fail with a Java "No region provided" error in Databricks Runtime 7.0 and above. ... Fitting an Apache SparkML model throws...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @jonathantonic I was able to reproduce this on AWS as well as Azure Databricks for Spark versions 2.4.5 and 3.0.1. We are working on coming up with a solution/workaround and will update you as soon as we have a plan. Thanks!
Thanks @jonathantonic I’m going to repro this bug and get back to you with our suggestions.