Enable passing .sql task results to a following .sql task as parameters
See original GitHub issueDocumentation page: community/index
Use case
I work in a business where we have an abundance of data but not necessarily the best data for data science efforts. In may cases we have millions of time series records for feature building but poor records of the target data we’d like to predict. Further complicating efforts are our many many different database servers of various type (oracle, sql, teradata, aws, azure).
Because of this I spend a lot of time helping our business become ‘data science ready’ by developing data pipelines to build useful datasets. I’m currently using python to integrate and transform disparate data sources and outputting to table
Request
I would like to be able to pass unique values from one .sql task to another as parameters. Something like this:
tasks:
- name: get-data-1
source: sql/first_data.sql
product: '{{data}}/raw/first_data.parquet'
client: src.clients.src1
chunksize: null
- name: get-data-2
source: sql/second-data.sql
product: '{{data}}/raw/second-data.parquet'
client: src.clients.src2
params:
startYear: '{{startYear}}'
endYear: '{{endYear}}'
upstream: get-data-1
chunksize: null
second-data-2.sql example
select *
from schema.table
where column in (upstream[params])
and year(date) between {{startYear}} and {{endYear}}
Alternatively, could I output the params via python as an intermediate .yaml and use as inputs like this?
import pandas as pd
import yaml
df = pd.read_parquet(upstream['first_data.parquet''])
vals = df['column'].unique()
d = dict({'val':vals})
with open("sample.yaml", "w") as f:
yaml.dump(d, f)
f.close()
Issue Analytics
- State:
- Created 2 years ago
- Comments:17 (9 by maintainers)
Hi @rockraptor5 and @reesehopkins,
Yes, it’s possible to have them as tuples. You could do something like this:
Then in SQL:
Or you may pass a list and have jinja format it:
I’ll take a stab at this feature, I’ll update so you can test it.
nice! bear in mind that this is a new feature and hasn’t been released yet, but you can still use it if you install it from git. I need to work on it a bit more, this will be part of the next release. I’ll close this issue when it’s out. Thanks a lot for your feedback!