Improved SQL functionality
See original GitHub issueLinked to PR #879
Description
Kedro’s SQL functionality is still missing some key features, some of which this PR seeks to add. Specifically, two main features are added:
- The existing
pandas.SQLQueryDataSet
is modified to allow for a long SQL query to be stored in a file and referenced through thefilepath
argument - A new dataset,
sql.SQLConnectionDataSet
is added to give the user access to asqlalchemy
Connection
object
Context
Being able to run complex queries on SQL databases is essential for many data science projects. However, doing this in a kedronic way, where all the I/O logic is offloaded to the catalog, is difficult when the queries are complex or it is preferable to use something other than pandas (extremely large datasets shouldn’t be loaded into memory, for instance).
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (7 by maintainers)
Top Results From Across the Web
What's new in SQL Server 2019 (15.x) - Microsoft Learn
SQL Server 2019 (15.x) introduces Big Data Clusters for SQL Server. It also provides additional capability and improvements for the SQL Server ...
Read more >Improvements of Scalar User-defined function performance in ...
In SQL Server, we normally use user-defined functions to write SQL queries. A UDF accepts parameters and returns the result as an output....
Read more >What needs improvement with SQL Server? - PeerSpot
Primarily, the data replication and the backup areas can be improved. It should have data replication capabilities and uptime capabilities. The native SQL...
Read more >Four ways to improve scalar function performance in SQL Server
Four ways to improve scalar function performance in SQL Server · Use the option WITH SCHEMABINDING · Use the option RETURNS NULL ON...
Read more >Top 10 SQL Query Optimization Tips to Improve Database ...
SQL Query Optimization Tips with Examples · Tip 1: Proper Indexing · TIP 2: Use SELECT <columns> instead of SELECT * · Tip...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@AntonyMilneQB I had the exact same discussion about whether to extend
SQLQueryDataSet
or write a new class entirely with @datajoely. I started with the new class approach and quickly realized that 90% of the code was identical. I also think that it makes more sense to have the same class from a user perspective. Imagine how someone might use this dataset: (1) start off with a simple query (SELECT col1 from table_a
) and then (2) the query gets longer and longer until finally it’s stored in a separate file. This would be easier if they simply need to change thesql
argument to afilepath
argument, rather than having to look up and find the name of a new dataset.As for the use of the SQLConnectionDataSet, this is something I’m still working on an example use case for (see the docstring) so I hope to have a more concrete answer (@datajoely if you have a more specific idea…) but in a nutshell, the idea is that it might be good to allow for a user to perform all data manipulations for a given node entirely on a SQL database and never load anything into memory. This is of course possible if they were to create the sqlalchemy object themselves, but creating this dataset allows for the use of Kedro credentials and less hardcoding, as one possible advantage.
@Galileo-Galilei thank you for your comments. I actually agree that the solution in point 1 (the
SQLQueryDataSet
extension) does not achieve the goals of avoiding loading everything in memory. That goal is entirely for point 2 (theSQLConnectionDataSet
). What is achieved by theSQLQueryDataSet
is simply allowing for complex SQL queries to be written in an external .sql file outside of the catalog (taking advantage of syntax highlighting and other IDE tools, which seems to be a small thing but honestly can be quite impactful for the developer experience; as well as making the catalog less cluttered). This dataset is indeed intended to load the result into memory 😄 . So in summary, this PR has two datasets, which solve two distinct but related issues having to do with SQL functionality in Kedro.Not at all. What I mean is that patching the SQLQueryDataSet is not a sustainable long term solution. In my opinion, the right solution is to remove from the catalog all the datasets which perform computation on different backend (the
XXXQueryDataSet
). Obviously, this assume we have a better way to perform such operations and it is not the case now. This PR solve a symptom and this is very useful for now but not the real issue ( in french I would say that we"put a bandage on a wooden leg" - not sure how to translate it, but it is quite explicit 😄 ).I think it is a very good move towards the right solution, and what I will suggest will be very similar. My main concern here is that I think the
DataCatalog
is not the right place to declare such a connection (I meansql.SQLConnectionDataSet
should not be aDataSet
but another object more suited to how we want to use it). However, I acknowledge that it is currently the best place because you can leverage Kedro’s credentials mechanism so it is certainly a first step towards a “happier kedro-SQL world”.To summarize if those 2 datasets were to be released tomorrow, I would probably make an extensive use of these 😃. Since refactoring the catalog is something that will likely take months (years?), it makes sense to provide them as a short term solution to some problems users are facing with Kedro/SQL interaction right now.