Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Very slow to write to Azure SQL

See original GitHub issue

Due to a recent company merger we run Spark on AWS and need to write to Azure SQL. We regularly write to MySQL and Postgresql and it’s very fast, but we’re finding that writing to SQL Server with the jdbc driver is unusably slow. I thought I’d try out this library due to the claimed speed, but I’m just not seeing any noticeable improvement.

My dataframe is about 28M rows, and I killed this after an hour when it had only written about 11M rows.

val properties = new java.util.Properties() properties.put("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") df.write. mode(SaveMode.Overwrite). option("truncate", true). option("schemaCheckEnabled", false). option("batchSize","1048576"). option("tableLock", "false"). jdbc(dbConnectionMap("url"), "spark_test", properties)

I’ve tried with numerous settings and have never seen an acceptable result. It’s possible this connector is faster than the default and I’m just not seeing the difference because I give up and kill them both, but it doesn’t really matter if it’s still too slow to use.

For reference, the solution we came up with is to repartition the dataframe to a single partition, write it to s3 as a single csv file, call a lambda function to transfer the csv file from s3 to azure blog storage, then do a bulk insert on that csv file. As convoluted as this solution is, it’s dramatically faster than just writing to sql server directly.

Writing csv: Wed Jan 12 17:07:57 UTC 2022 Done writing csv: Wed Jan 12 17:09:06 UTC 2022 Calling lambda and waiting for transfer to azure: Wed Jan 12 17:09:08 UTC 2022 File exists on azure: Wed Jan 12 17:10:32 UTC 2022 Executing bulk insert Wed Jan 12 17:11:00 UTC 2022 Complete bulk insert after: 4:17

That’s about 8 minutes total for 28M rows versus an hour for less than half that.

Is there a solution to this? I’m not seeing anywhere near the speeds you’re claiming in your docs. This doesn’t seem to be a database server induced limit since I can bulk insert the same data in 4 minutes from a csv file.

Issue Analytics

State:
Created 2 years ago
Comments:15 (5 by maintainers)

Top GitHub Comments

2reactions

timgautiercommented, Mar 24, 2022

  df.write
    .format("com.microsoft.sqlserver.jdbc.spark")
    .mode("overwrite")
    .option("url", url)
    .option("dbtable", "spark_test")
    .option("user", dbUser)
    .option("password", dbPassword)
    .option("database", dbName)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .save()

This is working. Issuing BULK INSERT statements now.

I see what happened. When calling .jdbc, you supply a java.util.Properties object, which typically contains the driver class name. If you call .save without giving the driver class name in the options, it errors. This is why I used .jdbc instead of .save. It’s what I use for other databases and bulk insert into them fine, but doesn’t work the same in SQL Server.

On the front page is this example:

df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("append") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password) \
    .save()

Which is pySpark, not Scala Spark, but looks almost identical. It seemed trivial to convert it to Scala, but that extra option required for Scala makes a big difference.

Thanks for the help, I’m going to finish up my testing and then I’ll close this. I assume it’ll be fast now.

1reaction

timgautiercommented, Feb 7, 2022

Thanks for the tip, but I’ve tried all those things plus many more and not seen improvement. We’ll be getting spark running in Azure soon I think and we’ll see if that helps.