question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Very slow to write to Azure SQL

See original GitHub issue

Due to a recent company merger we run Spark on AWS and need to write to Azure SQL. We regularly write to MySQL and Postgresql and it’s very fast, but we’re finding that writing to SQL Server with the jdbc driver is unusably slow. I thought I’d try out this library due to the claimed speed, but I’m just not seeing any noticeable improvement.

My dataframe is about 28M rows, and I killed this after an hour when it had only written about 11M rows.

val properties = new java.util.Properties() properties.put("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") df.write. mode(SaveMode.Overwrite). option("truncate", true). option("schemaCheckEnabled", false). option("batchSize","1048576"). option("tableLock", "false"). jdbc(dbConnectionMap("url"), "spark_test", properties)

I’ve tried with numerous settings and have never seen an acceptable result. It’s possible this connector is faster than the default and I’m just not seeing the difference because I give up and kill them both, but it doesn’t really matter if it’s still too slow to use.

For reference, the solution we came up with is to repartition the dataframe to a single partition, write it to s3 as a single csv file, call a lambda function to transfer the csv file from s3 to azure blog storage, then do a bulk insert on that csv file. As convoluted as this solution is, it’s dramatically faster than just writing to sql server directly.

Writing csv: Wed Jan 12 17:07:57 UTC 2022 Done writing csv: Wed Jan 12 17:09:06 UTC 2022 Calling lambda and waiting for transfer to azure: Wed Jan 12 17:09:08 UTC 2022 File exists on azure: Wed Jan 12 17:10:32 UTC 2022 Executing bulk insert Wed Jan 12 17:11:00 UTC 2022 Complete bulk insert after: 4:17

That’s about 8 minutes total for 28M rows versus an hour for less than half that.

Is there a solution to this? I’m not seeing anywhere near the speeds you’re claiming in your docs. This doesn’t seem to be a database server induced limit since I can bulk insert the same data in 4 minutes from a csv file.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
timgautiercommented, Mar 24, 2022
  df.write
    .format("com.microsoft.sqlserver.jdbc.spark")
    .mode("overwrite")
    .option("url", url)
    .option("dbtable", "spark_test")
    .option("user", dbUser)
    .option("password", dbPassword)
    .option("database", dbName)
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .save()

This is working. Issuing BULK INSERT statements now.

I see what happened. When calling .jdbc, you supply a java.util.Properties object, which typically contains the driver class name. If you call .save without giving the driver class name in the options, it errors. This is why I used .jdbc instead of .save. It’s what I use for other databases and bulk insert into them fine, but doesn’t work the same in SQL Server.

On the front page is this example:

df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("append") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password) \
    .save()

Which is pySpark, not Scala Spark, but looks almost identical. It seemed trivial to convert it to Scala, but that extra option required for Scala makes a big difference.

Thanks for the help, I’m going to finish up my testing and then I’ll close this. I assume it’ll be fast now.

1reaction
timgautiercommented, Feb 7, 2022

Thanks for the tip, but I’ve tried all those things plus many more and not seen improvement. We’ll be getting spark running in Azure soon I think and we’ll see if that helps.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot slow SQL Server performance caused by I/O ...
If SQL Server and the OS indicate that the I/O subsystem is slow, check if the cause is the system being overwhelmed beyond...
Read more >
Why is writing from R to Azure SQL Server is very slow
I am trying to build out some data repositories in an Azure Managed Instance/Sql Server DB. I have been shocked by how slow...
Read more >
Azure SQL DB is Slow: Do I Need to Buy More DTUs?
You've got an Azure SQL DB, and your queries are going slow. You're wondering, “Am I hitting the performance limits? Is Microsoft throttling ......
Read more >
Migrated DB to SQL Azure very slow - SQLServerCentral
It's hard to say without a lot more data. Capture query metrics using extended events. Capture wait statistics (you can combine these with...
Read more >
insert on table being unnaturally slow - DBA Stack Exchange
These wait types are caused specifically due to the artificial limits put on the rate at which you can write to the transaction...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found