Enhance support for parameterised SQL query datasets
See original GitHub issueDescription
It would be useful to extend the ability of SQLQueryDatasets to make use of parameterised queries that can be given parameters at runtime.
Context
With the current SQLQueryDataset
, parameters can be used in some cases. For example parameters can be passed in from globals.yml
#globals.yml
foo: bar
#catalog.yml
sql:
type: pandas.SQLQueryDataset
sql: "SELECT * FROM table WHERE column = ${foo}"
However, when these values are loaded from a yaml file, their string representation of the corresponding python object is used. This is a problem for lists as the following would not produce a valid SQL query
#globals.yml
list:
- a
- b
- c
#catalog.yml
sql:
type: pandas.SQLQueryDataset
sql: "SELECT * FROM table WHERE column in ${foo};" # parsed as "SELECT * FROM table WHERE column in ['a','b','c'];"
Possible Implementation
The jinjasql package provides some utilities for parsing templates using jinja syntax.
import jinjasql
def render_template_query(template, **kwargs) -> (str, dict):
"""Renders a query from a sql template.
Examples
--------
Additional keyword arguments are used to fill in parameters
>>> template = "SELECT * FROM table WHERE column = {{foo}}"
>>> query, params = render_template_query(template, foo='bar')
Columns and Tables must be marked as 'sqlsafe' to be parameterised.
>>> template = "SELECT * FROM {{table | sqlsafe}} WHERE {{col | sqlsafe}} = {{foo}}"
>>> query, params = render_template_query(template, table='mytable', col='column', foo='bar')
Collections of values can be used to paramterise in-clauses
>>> template = "SELECT * FROM table WHERE column in {{values | inclause}}"
>>> query, params = render_template_query(template, values=['a','b','c'])
"""
# pyodbc uses qmark syntax
jinja = jinjasql.JinjaSql(param_style="qmark")
query, params = jinja.prepare_query(template, kwargs)
return query, params
This can then be passed into pandas.read_sql_query
as follows
import pandas as pd
con = ...
template= "SELECT * FROM table WHERE column in {{values | inclause}}"
query, params = render_template_query(template, values=['a','b','c'])
pd.read_sql_query(query, con=con, params=params)
From a configuration point of view, it might be useful in add a keyword to SQLQueryDataset
to make it explicit that the query is a template, rather than a valid SQL string, e.g.
sql:
type: pandas.SQLQueryDataset
template: "SELECT * FROM table WHERE column in {{foo | inclause}};"
I haven’t given much thought yet as to how this could take values from runtime parameters. I think it would require some additional validation, e.g. checking that all the parameters in the template have a value
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Thanks for the tips. I think you’re suggestion @AntonyMilneQB should work for my particular use-case. I can see the reasoning behind limiting the usage of SQL in Kedro. Perhaps I need to look into
dbt
and see if that’s more suitableOkay great! I’m glad it’s resolved 🙂