JdbcSource to partition queries for potential performance improvements
See original GitHub issue- Spark can partition data on a JDBC data frame by by specifying the following binding parameters which are all longs: lowerBound, upperBound, numPartitions and and partition key column
- To take advantage of this in JdbcSource, the data has to be divided into multiple partitions (in multiple Threads). In turn these binding parameters are used to alter the original query for each partition, e.g. add a predicate to restrict query on each partition - for Oracle the partition key could be rownum which is available on every table, e.g. for a population consisting of 100 rows with a specification of: lowerBound=1, upperBound=100 and numPartitions=5 and query=select col1, col2 from table where blah would result in the following partitions queries on each partition:
Part 1:
select * (select col1, col2 from table where blah)
where rownum between 1 and 20
Part 2:
select * (select col1, col2 from table where blah)
where rownum between 21 and 40
Part 3:
select * (select col1, col2 from table where blah)
where rownum between 41 and 60
Part 4:
select * (select col1, col2 from table where blah)
where rownum between 61 and 80
Part 5:
select * (select col1, col2 from table where blah)
where rownum >= 81
-
For partition 5 just return the remainder of rows.
-
Note it may not be necessary to create N connections for each partition - simply return N JDBC result sets - one for each partition - investigating this…
Proposal
- withPartition(lowerBound, upperBound, partitionColumn)
.withPartition(1, 100, rownum)
- rownum is the partition key which is internal to Oracle however you can’t use this across the board with a function like withPartition(1,100), e.g. SQLServer doesn’t have rownum, however it can be achieved using a the windowing function ROW_NUMBER():
where RowNumber = ROW_NUMBER() OVER (ORDER BY CustomerID ASC)
- where CustomerID is the primary key and ROW_NUMBER() is the SQLServer windowing function.
- One idea is to provide a upper bound function like so:
withPartition(lowerBound:Long, upperBoundFn(query:String) => Long, partitionColumn:String)
- the upperBoundFn() could do anything you like such as execute a separate query or just supply the count.
Example of returning N JDBC result sets
public static void executeProcedure(Connection con) {
try {
CallableStatement stmt = con.prepareCall(...);
..... //Set call parameters, if you have IN,OUT, or IN/OUT parameters
boolean results = stmt.execute();
int rsCount = 0;
//Loop through the available result sets.
while (results) {
ResultSet rs = stmt.getResultSet();
//Retrieve data from the result set.
while (rs.next()) {
....// using rs.getxxx() method to retieve data
}
rs.close();
//Check for next result set
results = stmt.getMoreResults();
}
stmt.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
Issue Analytics
- State:
- Created 7 years ago
- Comments:14 (10 by maintainers)
Top Results From Across the Web
How to optimize partitioning when migrating data from JDBC ...
Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is...
Read more >Partitioning for Performance
Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a ......
Read more >JDBC To Other Databases - Spark 3.3.1 Documentation
Property Name Default Scope
url (none) read/write
dbtable (none) read/write
query (none) read/write
Read more >Improve Performance Of MySQL Queries Using Table ...
This is a small tutorial on how to improve performance of MySQL queries by using partitioning. As a Data Engineer I frequently come...
Read more >Oracle Partitioning in Oracle Database 12c Release 2
It can often improve query performance by several orders of magnitude by leveraging the partitioning metadata to only touch the data of relevance...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
More analysis
Multi-query tests performed on an 8 core PC
Code
Solution
There’s a floor in my proposal using rownum - each rownum predicate performs a table scan - Spark does this differently in a more efficient manner.
You can for example supply a hash function on a column (primary key) to return a long - most databases come with some kind of hash function