Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

To allow specifying the gcp project for bigquery.jobuser in profile.yml

See original GitHub issue

Describe the feature

To allow dbt to process bigquery data in one project using query job from another project.

BigQuery-context We have different types of dbt runs in production ( hourly runs, daily runs, 10-minute runs) that process the same dataset. Some runs, i.e. 10-minute runs are more critical than the others and require dedicated BQ slot reservation for them.

To do that, it’s a common and necessary practice in BigQuery workload management to process Bigquery data in one project gcp_project_A.prod_dataset, using Bigquery slot resource from another project gcp_project_B

To do that, we would have a gcp-service-account-1 would have the bigquery.dataViewer permission to gcp_project_A.prod_dataset and bigquery.jobUser permission in gcp_project_B.

the python code below would process gcp_project_A.prod_dataset using query job from another gcp_project_B.

database = "gcp_project_A"
gcp_job_project = "gcp_project_B"

client = google.cloud.bigquery.Client(
    project=gcp_job_project,
    # credentials=creds,
    location="US"
)

sql = ("SELECT  count(*) as count "
        " FROM `{}.gcp_project_B.table_1`".format(database)
    )

query_job = client.query(sql)
query_result = query_job.result(timeout=20)
print (list(query_result))

dbt-context

Currently, there is no easy way in dbt for this gcp-service-account-1 to process gcp_project_A.prod_dataset using query job from another gcp_project_B.

because profile.yml only specifies one gcp project which is used as both project of the dataset as well as the project of the bq job user, as shown below.

    stage:
      type: bigquery
      project: gcp_project_A
      dataset: prod_dataset
      keyfile: /mnt/gcp-service-account-1.json
      method: service-account

And in the code below, the database (gcp_project_A) is also used as the project of bigquery.Client where the bq query job would run from.

https://github.com/fishtown-analytics/dbt/blob/fec0e31a25d5b922cb1833cffcb5095eb4ee642b/plugins/bigquery/dbt/adapters/bigquery/connections.py#L218-L220

suggested-solution

to add a new optional key variable in profile.yml, named, for instance, bq_job_project. If the variable bq_job_project is presented as below, use that variable to bigquery.Client.

    stage:
      type: bigquery
      project: gcp_project_A
      dataset: prod_dataset
     jobuser_project: gcp_project_B
      keyfile: /mnt/gcp-service-account-1.json
      method: service-account

Describe alternatives you’ve considered

Currently, the workaround is

set project: gcp_project_B in profile.yml
overwrite the generate_database_name() macro so that it generates gcp_project_A

It could work for us in the short run. However, when we later need generate_database_name() for what it’s intended for, to generate different custom_database_name for different models in the same dbt run, we would run out of options.

Additional context

this feature is bigquery-specific. I don’t think it’s relevant to other databases.

Who will this benefit?

Any team or project that requires more advanced bigquery workload management, BQ reservation allocations for dbt ETL loads

Are you interested in contributing this feature?

yes, I would be happy to make a PR for this.

Issue Analytics

State:
Created 3 years ago
Reactions:8
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

jtcohen6commented, Aug 13, 2021

@hui-zheng Sorry I never picked this back up! It looks like someone else had the same idea in https://github.com/dbt-labs/dbt/issues/3708, and contributed the code for it, too.

I’ve become more amenable to this idea over the past several months. There’s a good chance it will be implemented after all.

0reactions

hui-zhengcommented, Mar 18, 2021

On a separate note,

Using a dedicated BQ slot project for running queries is a very BigQuery-specific concept. It makes sense to provide a good way to specify that in profile BigQuery connection. So that we don’t have to hijack the target.project in profile.yml for that purpose.

The purpose of target.project shall be reserved for defining the default path for data assets, that is, dbt models/tables, instead of defining the computation resource.

Please re-consider the original proposal of adding bq_job_project var into profile.yml.