Very slow to write to Azure SQL
See original GitHub issueDue to a recent company merger we run Spark on AWS and need to write to Azure SQL. We regularly write to MySQL and Postgresql and it’s very fast, but we’re finding that writing to SQL Server with the jdbc driver is unusably slow. I thought I’d try out this library due to the claimed speed, but I’m just not seeing any noticeable improvement.
My dataframe is about 28M rows, and I killed this after an hour when it had only written about 11M rows.
val properties = new java.util.Properties()
properties.put("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
df.write.
mode(SaveMode.Overwrite).
option("truncate", true).
option("schemaCheckEnabled", false).
option("batchSize","1048576").
option("tableLock", "false").
jdbc(dbConnectionMap("url"), "spark_test", properties)
I’ve tried with numerous settings and have never seen an acceptable result. It’s possible this connector is faster than the default and I’m just not seeing the difference because I give up and kill them both, but it doesn’t really matter if it’s still too slow to use.
For reference, the solution we came up with is to repartition the dataframe to a single partition, write it to s3 as a single csv file, call a lambda function to transfer the csv file from s3 to azure blog storage, then do a bulk insert on that csv file. As convoluted as this solution is, it’s dramatically faster than just writing to sql server directly.
Writing csv: Wed Jan 12 17:07:57 UTC 2022
Done writing csv: Wed Jan 12 17:09:06 UTC 2022
Calling lambda and waiting for transfer to azure: Wed Jan 12 17:09:08 UTC 2022
File exists on azure: Wed Jan 12 17:10:32 UTC 2022
Executing bulk insert Wed Jan 12 17:11:00 UTC 2022
Complete bulk insert after: 4:17
That’s about 8 minutes total for 28M rows versus an hour for less than half that.
Is there a solution to this? I’m not seeing anywhere near the speeds you’re claiming in your docs. This doesn’t seem to be a database server induced limit since I can bulk insert the same data in 4 minutes from a csv file.
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (5 by maintainers)
Top GitHub Comments
This is working. Issuing
BULK INSERT
statements now.I see what happened. When calling
.jdbc
, you supply ajava.util.Properties
object, which typically contains the driver class name. If you call.save
without giving the driver class name in the options, it errors. This is why I used.jdbc
instead of.save
. It’s what I use for other databases and bulk insert into them fine, but doesn’t work the same in SQL Server.On the front page is this example:
Which is pySpark, not Scala Spark, but looks almost identical. It seemed trivial to convert it to Scala, but that extra option required for Scala makes a big difference.
Thanks for the help, I’m going to finish up my testing and then I’ll close this. I assume it’ll be fast now.
Thanks for the tip, but I’ve tried all those things plus many more and not seen improvement. We’ll be getting spark running in Azure soon I think and we’ll see if that helps.