question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

JdbcSource to partition queries for potential performance improvements

See original GitHub issue
  • Spark can partition data on a JDBC data frame by by specifying the following binding parameters which are all longs: lowerBound, upperBound, numPartitions and and partition key column
  • To take advantage of this in JdbcSource, the data has to be divided into multiple partitions (in multiple Threads). In turn these binding parameters are used to alter the original query for each partition, e.g. add a predicate to restrict query on each partition - for Oracle the partition key could be rownum which is available on every table, e.g. for a population consisting of 100 rows with a specification of: lowerBound=1, upperBound=100 and numPartitions=5 and query=select col1, col2 from table where blah would result in the following partitions queries on each partition:

Part 1:

select * (select col1, col2 from table where blah)
where rownum between 1 and 20

Part 2:

select * (select col1, col2 from table where blah)
where rownum between 21 and 40

Part 3:

select * (select col1, col2 from table where blah)
where rownum between 41 and 60

Part 4:

select * (select col1, col2 from table where blah)
where rownum between 61 and 80

Part 5:

select * (select col1, col2 from table where blah)
where rownum >= 81
  • For partition 5 just return the remainder of rows.

  • Note it may not be necessary to create N connections for each partition - simply return N JDBC result sets - one for each partition - investigating this…

Proposal

  • withPartition(lowerBound, upperBound, partitionColumn)
.withPartition(1, 100, rownum)
  • rownum is the partition key which is internal to Oracle however you can’t use this across the board with a function like withPartition(1,100), e.g. SQLServer doesn’t have rownum, however it can be achieved using a the windowing function ROW_NUMBER():
where RowNumber = ROW_NUMBER() OVER (ORDER BY CustomerID ASC)
  • where CustomerID is the primary key and ROW_NUMBER() is the SQLServer windowing function.
  • One idea is to provide a upper bound function like so:
withPartition(lowerBound:Long, upperBoundFn(query:String) => Long, partitionColumn:String)
  • the upperBoundFn() could do anything you like such as execute a separate query or just supply the count.

Example of returning N JDBC result sets

	
public static void executeProcedure(Connection con) {
   try {
      CallableStatement stmt = con.prepareCall(...);
      .....  //Set call parameters, if you have IN,OUT, or IN/OUT parameters

      boolean results = stmt.execute();
      int rsCount = 0;

      //Loop through the available result sets.
     while (results) {
           ResultSet rs = stmt.getResultSet();
           //Retrieve data from the result set.
           while (rs.next()) {
        ....// using rs.getxxx() method to retieve data
           }
           rs.close();

        //Check for next result set
        results = stmt.getMoreResults();
      } 
      stmt.close();
   }
   catch (Exception e) {
      e.printStackTrace();
   }
}

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:14 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
hannesmillercommented, Feb 15, 2017

More analysis

  • I have been doing some JDBC performance tests on an Oracle table of 50 million rows.
  • If I run with a single thread it take and average of 175 seconds over 5 runs.
  • If I break up the query into multiple queries using a hash to partition each query then it dramatically drops down to 37 seconds.

Multi-query tests performed on an 8 core PC

  • The idea was to assign each core to a query running in a separate thread
  • I use the same fetchsize as the original single query
  • Each query thread ascertains a connection from a JDBC connection pool

Code

package hannesmiller

import java.io.{File, PrintWriter}
import java.util.concurrent.{Callable, Executors}

import com.sksamuel.exts.Logging
import org.apache.commons.dbcp2.BasicDataSource

import scala.collection.mutable.ListBuffer

object MultiHashDcfQuery extends App with Logging {

  private def generateStatsFile(fileName: String, stats: ListBuffer[String]): Unit = {
    val statsFile = new File(fileName)
    println(s"Generating ${statsFile.getAbsolutePath} ...")
    val statsFileWriter = new PrintWriter(statsFile)
    stats.foreach { s => statsFileWriter.write(s + "\n"); statsFileWriter.flush() }
    statsFileWriter.close()
    println(s"${statsFile.getAbsolutePath} done!")
  }

  val recordCount = 49510353L
  val partitionsStartNumber = 2
  val numberOfPartitions = 8
  val numberOfRuns = 1

  val sql =
    s"""SELECT MY_PRIMARY_KEY, COL2, COL3, COL4, COL5
       FROM MY_TABLE
       WHERE COL2 in (8682)"""

  def buildPartitionSql(bindExpression: String, bindExpressionAlias: String): String = {
    s"""
       |SELECT *
       |FROM (
       |  SELECT eel_tmp.*, $bindExpression AS $bindExpressionAlias
       |  FROM ( $sql ) eel_tmp
       |)
       |WHERE $bindExpressionAlias = ?
       |""".stripMargin
  }

  // Setup the database connection pool equal to the number of partitions - could be less depending on your connection
  // resource limit on the Database server.
  val dataSource = new BasicDataSource()
  dataSource.setDriverClassName("oracle.jdbc.OracleDriver")
  dataSource.setUrl("jdbc:oracle:thin:@//myhost:1901/myservice")
  dataSource.setUsername("username")
  dataSource.setPassword("username1234")
  dataSource.setPoolPreparedStatements(false)
  dataSource.setInitialSize(numberOfPartitions)
  dataSource.setDefaultAutoCommit(false)
  dataSource.setMaxOpenPreparedStatements(numberOfPartitions)

  val stats = ListBuffer[String]()
  for (numPartitions <- partitionsStartNumber to numberOfPartitions) {
    for (runNumber <- 1 to numberOfRuns) {

      // Kick off a number of threads equal to the number of partitions so each partitioned query is executed on parallel.
      val threadPool = Executors.newFixedThreadPool(numberOfPartitions)
      val startTime = System.currentTimeMillis()
      val fetchSize = 100600
      val futures = for (i <- 1 to numberOfPartitions) yield {
        threadPool.submit(new Callable[(Long, Long, Long, Long)] {
          override def call(): (Long, Long, Long, Long) = {
            var rowCount = 0L

            // Capture metrics about acquiring connection
            val connectionIdleTimeStart = System.currentTimeMillis()
            val connection = dataSource.getConnection
            val connectionIdleTime = System.currentTimeMillis() - connectionIdleTimeStart

            val partSql = buildPartitionSql(s"MOD(ORA_HASH(MY_PRIMARY_KEY),$numberOfPartitions) + 1", "PARTITION_NUMBER")
            val prepareStatement = connection.prepareStatement(partSql)
            prepareStatement.setFetchSize(fetchSize)
            prepareStatement.setLong(1, i)


            // Capture metrics for query execution
            val excuteQueryTimeStart = System.currentTimeMillis()
            val rs = prepareStatement.executeQuery()
            val executeQueryTime = (System.currentTimeMillis() - excuteQueryTimeStart) / 1000

            // Capture metrics for fetching data
            val fetchTimeStart = System.currentTimeMillis()
            while (rs.next()) {
              rowCount += 1
              if (rowCount % fetchSize == 0) logger.info(s"RowCount = $rowCount")
            }
            val fetchTime = (System.currentTimeMillis() - fetchTimeStart) / 1000

            prepareStatement.close()
            rs.close()
            connection.close()
            (connectionIdleTime, executeQueryTime, fetchTime, rowCount)
          }
        })
      }

      // Total up all the rows
      var totalRowCount = 0L
      var totalConnectionIdleTime = 0L
      futures.foreach { f =>
        val (connectionIdleTime, executeQueryTime, fetchTime, rowCount) = f.get
        logger.info(s"connectionIdleTime=$connectionIdleTime, executeQueryTime=$executeQueryTime, fetchTime=$fetchTime, rowCount=$rowCount")
        totalConnectionIdleTime += connectionIdleTime
        totalRowCount += rowCount
      }
      val elapsedTime = (System.currentTimeMillis() - startTime) / 1000.0
      logger.info(s"Run $runNumber with $numPartitions partition(s): Took $elapsedTime second(s) for RowCount = $totalRowCount, totalConnectionIdlTime = $totalConnectionIdleTime")
      threadPool.shutdownNow()
      stats += s"$numPartitions\t$runNumber\t$elapsedTime"
    }
  }
  generateStatsFile("multi_partition_stats.csv", stats)

}
  • For each thread I create the partitioned SQL using Oracle MOD/HASH functions on the primary key column:
  val partSql = buildPartitionSql(s"MOD(ORA_HASH(F_CASH_FLOW_ID),$numberOfPartitions) + 1", "PARTITION_NUMBER")
  ...
  ...
  def buildPartitionSql(bindExpression: String, bindExpressionAlias: String): String = {
    s"""
       |SELECT *
       |FROM (
       |  SELECT eel_tmp.*, $bindExpression AS $bindExpressionAlias
       |  FROM ( $sql ) eel_tmp
       |)
       |WHERE $bindExpressionAlias = ?
       |""".stripMargin
  }
  • The SQL returned augments the original query with the bindExpression argument and aliases it to the column PARTITION_NUMBER
  • The subsequent lines creates a JDBC prepared statement and plants the desired partition number:
val prepareStatement = connection.prepareStatement(partSql)
prepareStatement.setFetchSize(fetchSize)
prepareStatement.setLong(1, i)

Solution

  • Can we implement this in EEL on the JdbcSource
  • It’s very difficult to generalize as this mechanism may not behave the same way on another DBMS like SqlServer (they do have hash and mod functions though).
  • That 's why am I am proposing to pass in a expression and an alias to column
1reaction
hannesmillercommented, Feb 6, 2017

There’s a floor in my proposal using rownum - each rownum predicate performs a table scan - Spark does this differently in a more efficient manner.

You can for example supply a hash function on a column (primary key) to return a long - most databases come with some kind of hash function

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to optimize partitioning when migrating data from JDBC ...
Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is...
Read more >
Partitioning for Performance
Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a ......
Read more >
JDBC To Other Databases - Spark 3.3.1 Documentation
Property Name Default Scope url (none) read/write dbtable (none) read/write query (none) read/write
Read more >
Improve Performance Of MySQL Queries Using Table ...
This is a small tutorial on how to improve performance of MySQL queries by using partitioning. As a Data Engineer I frequently come...
Read more >
Oracle Partitioning in Oracle Database 12c Release 2
It can often improve query performance by several orders of magnitude by leveraging the partitioning metadata to only touch the data of relevance...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found