question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improved SQL functionality

See original GitHub issue

Linked to PR #879

Description

Kedro’s SQL functionality is still missing some key features, some of which this PR seeks to add. Specifically, two main features are added:

  1. The existing pandas.SQLQueryDataSet is modified to allow for a long SQL query to be stored in a file and referenced through the filepath argument
  2. A new dataset, sql.SQLConnectionDataSet is added to give the user access to a sqlalchemy Connection object

Context

Being able to run complex queries on SQL databases is essential for many data science projects. However, doing this in a kedronic way, where all the I/O logic is offloaded to the catalog, is difficult when the queries are complex or it is preferable to use something other than pandas (extremely large datasets shouldn’t be loaded into memory, for instance).

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
BenjaminLevyQBcommented, Sep 9, 2021

@AntonyMilneQB I had the exact same discussion about whether to extend SQLQueryDataSet or write a new class entirely with @datajoely. I started with the new class approach and quickly realized that 90% of the code was identical. I also think that it makes more sense to have the same class from a user perspective. Imagine how someone might use this dataset: (1) start off with a simple query (SELECT col1 from table_a) and then (2) the query gets longer and longer until finally it’s stored in a separate file. This would be easier if they simply need to change the sql argument to a filepath argument, rather than having to look up and find the name of a new dataset.

As for the use of the SQLConnectionDataSet, this is something I’m still working on an example use case for (see the docstring) so I hope to have a more concrete answer (@datajoely if you have a more specific idea…) but in a nutshell, the idea is that it might be good to allow for a user to perform all data manipulations for a given node entirely on a SQL database and never load anything into memory. This is of course possible if they were to create the sqlalchemy object themselves, but creating this dataset allows for the use of Kedro credentials and less hardcoding, as one possible advantage.

@Galileo-Galilei thank you for your comments. I actually agree that the solution in point 1 (the SQLQueryDataSet extension) does not achieve the goals of avoiding loading everything in memory. That goal is entirely for point 2 (the SQLConnectionDataSet). What is achieved by the SQLQueryDataSet is simply allowing for complex SQL queries to be written in an external .sql file outside of the catalog (taking advantage of syntax highlighting and other IDE tools, which seems to be a small thing but honestly can be quite impactful for the developer experience; as well as making the catalog less cluttered). This dataset is indeed intended to load the result into memory 😄 . So in summary, this PR has two datasets, which solve two distinct but related issues having to do with SQL functionality in Kedro.

2reactions
Galileo-Galileicommented, Sep 11, 2021

Do you think there is an actual problem with the point 1 (extending SQLQueryDataSet to accept filepath)

Not at all. What I mean is that patching the SQLQueryDataSet is not a sustainable long term solution. In my opinion, the right solution is to remove from the catalog all the datasets which perform computation on different backend (the XXXQueryDataSet). Obviously, this assume we have a better way to perform such operations and it is not the case now. This PR solve a symptom and this is very useful for now but not the real issue ( in french I would say that we"put a bandage on a wooden leg" - not sure how to translate it, but it is quite explicit 😄 ).

Do you think adding the [sql.SQLConnectionDataSet] is actually a bad move as an incremental step towards a happier kedro-SQL world or again just that it’s not the full solution?

I think it is a very good move towards the right solution, and what I will suggest will be very similar. My main concern here is that I think the DataCatalog is not the right place to declare such a connection (I mean sql.SQLConnectionDataSet should not be a DataSet but another object more suited to how we want to use it). However, I acknowledge that it is currently the best place because you can leverage Kedro’s credentials mechanism so it is certainly a first step towards a “happier kedro-SQL world”.

To summarize if those 2 datasets were to be released tomorrow, I would probably make an extensive use of these 😃. Since refactoring the catalog is something that will likely take months (years?), it makes sense to provide them as a short term solution to some problems users are facing with Kedro/SQL interaction right now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's new in SQL Server 2019 (15.x) - Microsoft Learn
SQL Server 2019 (15.x) introduces Big Data Clusters for SQL Server. It also provides additional capability and improvements for the SQL Server ...
Read more >
Improvements of Scalar User-defined function performance in ...
In SQL Server, we normally use user-defined functions to write SQL queries. A UDF accepts parameters and returns the result as an output....
Read more >
What needs improvement with SQL Server? - PeerSpot
Primarily, the data replication and the backup areas can be improved. It should have data replication capabilities and uptime capabilities. The native SQL...
Read more >
Four ways to improve scalar function performance in SQL Server
Four ways to improve scalar function performance in SQL Server · Use the option WITH SCHEMABINDING · Use the option RETURNS NULL ON...
Read more >
Top 10 SQL Query Optimization Tips to Improve Database ...
SQL Query Optimization Tips with Examples · Tip 1: Proper Indexing · TIP 2: Use SELECT <columns> instead of SELECT * · Tip...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found