To allow specifying the gcp project for bigquery.jobuser in profile.yml
See original GitHub issueDescribe the feature
To allow dbt to process bigquery data in one project using query job from another project.
BigQuery-context We have different types of dbt runs in production ( hourly runs, daily runs, 10-minute runs) that process the same dataset. Some runs, i.e. 10-minute runs are more critical than the others and require dedicated BQ slot reservation for them.
To do that, it’s a common and necessary practice in BigQuery workload management to process Bigquery data in one project gcp_project_A
.prod_dataset
, using Bigquery slot resource from another project gcp_project_B
To do that, we would have a gcp-service-account-1
would have the bigquery.dataViewer permission to gcp_project_A
.prod_dataset
and bigquery.jobUser
permission in gcp_project_B
.
the python code below would process gcp_project_A
.prod_dataset
using query job from another gcp_project_B
.
database = "gcp_project_A"
gcp_job_project = "gcp_project_B"
client = google.cloud.bigquery.Client(
project=gcp_job_project,
# credentials=creds,
location="US"
)
sql = ("SELECT count(*) as count "
" FROM `{}.gcp_project_B.table_1`".format(database)
)
query_job = client.query(sql)
query_result = query_job.result(timeout=20)
print (list(query_result))
dbt-context
Currently, there is no easy way in dbt for this gcp-service-account-1
to process gcp_project_A
.prod_dataset
using query job from another gcp_project_B
.
because profile.yml
only specifies one gcp project which is used as both project of the dataset as well as the project of the bq job user, as shown below.
stage:
type: bigquery
project: gcp_project_A
dataset: prod_dataset
keyfile: /mnt/gcp-service-account-1.json
method: service-account
And in the code below, the database (gcp_project_A) is also used as the project of bigquery.Client where the bq query job would run from.
suggested-solution
to add a new optional key variable in profile.yml
, named, for instance, bq_job_project
. If the variable bq_job_project
is presented as below, use that variable to bigquery.Client.
stage:
type: bigquery
project: gcp_project_A
dataset: prod_dataset
jobuser_project: gcp_project_B
keyfile: /mnt/gcp-service-account-1.json
method: service-account
Describe alternatives you’ve considered
Currently, the workaround is
- set
project: gcp_project_B
in profile.yml - overwrite the
generate_database_name()
macro so that it generatesgcp_project_A
It could work for us in the short run. However, when we later need generate_database_name()
for what it’s intended for, to generate different custom_database_name
for different models in the same dbt run, we would run out of options.
Additional context
this feature is bigquery-specific. I don’t think it’s relevant to other databases.
Who will this benefit?
Any team or project that requires more advanced bigquery workload management, BQ reservation allocations for dbt ETL loads
Are you interested in contributing this feature?
yes, I would be happy to make a PR for this.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:8
- Comments:6 (3 by maintainers)
Top GitHub Comments
@hui-zheng Sorry I never picked this back up! It looks like someone else had the same idea in https://github.com/dbt-labs/dbt/issues/3708, and contributed the code for it, too.
I’ve become more amenable to this idea over the past several months. There’s a good chance it will be implemented after all.
On a separate note,
Using a dedicated BQ slot project for running queries is a very BigQuery-specific concept. It makes sense to provide a good way to specify that in profile BigQuery connection. So that we don’t have to hijack the target.project in profile.yml for that purpose.
The purpose of
target.project
shall be reserved for defining the default path for data assets, that is, dbt models/tables, instead of defining the computation resource.Please re-consider the original proposal of adding
bq_job_project
var into profile.yml.