Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enable passing .sql task results to a following .sql task as parameters

See original GitHub issue

Documentation page: community/index

Use case

I work in a business where we have an abundance of data but not necessarily the best data for data science efforts. In may cases we have millions of time series records for feature building but poor records of the target data we’d like to predict. Further complicating efforts are our many many different database servers of various type (oracle, sql, teradata, aws, azure).

Because of this I spend a lot of time helping our business become ‘data science ready’ by developing data pipelines to build useful datasets. I’m currently using python to integrate and transform disparate data sources and outputting to table

Request

I would like to be able to pass unique values from one .sql task to another as parameters. Something like this:

tasks:
  - name: get-data-1
    source: sql/first_data.sql
    product: '{{data}}/raw/first_data.parquet'
    client: src.clients.src1
    chunksize: null
  
  - name: get-data-2
    source: sql/second-data.sql
    product: '{{data}}/raw/second-data.parquet'
    client: src.clients.src2
    params:
      startYear: '{{startYear}}'
      endYear: '{{endYear}}'      
    upstream: get-data-1
    chunksize: null

second-data-2.sql example

select *
from schema.table
where column in (upstream[params])
and year(date) between {{startYear}} and {{endYear}}

Alternatively, could I output the params via python as an intermediate .yaml and use as inputs like this?

import pandas as pd
import yaml

df = pd.read_parquet(upstream['first_data.parquet''])
vals = df['column'].unique()

d = dict({'val':vals})

with open("sample.yaml", "w") as f:
  yaml.dump(d, f)
  f.close()

Issue Analytics

State:
Created 2 years ago
Comments:17 (9 by maintainers)

Top GitHub Comments

2reactions

edublancascommented, Jan 7, 2022

Hi @rockraptor5 and @reesehopkins,

Yes, it’s possible to have them as tuples. You could do something like this:

def my_param(upstream):
    return tuple(pd.read_parquet(upstream["first"]).x.sum())

Then in SQL:

SELECT * FROM TABLE WHERE x in {{my_param}}

Or you may pass a list and have jinja format it:

SELECT * FROM TABLE WHERE x in ( {{my_param | join(', ') }} )

I’ll take a stab at this feature, I’ll update so you can test it.

1reaction

edublancascommented, Jan 25, 2022

nice! bear in mind that this is a new feature and hasn’t been released yet, but you can still use it if you install it from git. I need to work on it a bit more, this will be part of the next release. I’ll close this issue when it’s out. Thanks a lot for your feedback!