Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support BigQuery Job Tags and Labels

See original GitHub issue

Describe the feature

I would like to be able to control tagging and labeling of BigQuery Jobs as I run dbt on BigQuery.

A similar (but not the same) issue is #1947, for labeling BigQuery Tables and Datasets. This issue focuses on BigQuery Jobs (such as Insert Jobs or Query Jobs).

Describe alternatives you’ve considered

It’s not possible to label or tag jobs after they have started. From the docs

You cannot add labels to or update labels on pending, running, or completed jobs.

Additional context

The main reason why one would tag and label their BigQuery Job is to analyze BigQuery spend. For example, if one were able to link a BigQuery Job to a certain Airflow operator run (or similar – in my case a python script run by a cron! 😄) then a real dollar value can be put on running that operator over time.

I think it’s important to give the developer control on what tags and labels can be added, so it supports their data processing setup. And so I think tags and labels should be able to be set at launch-time. (In my case, I run a python script that calls dbt run – I would want my python script to be able to set the BigQuery Job tags and labels, while the Jobs are ultimately launched by dbt run.)

Who will this benefit?

Folks who are responsible for their BigQuery spend should benefit by using relevant Job tags and labels.

Thanks!

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:12 (8 by maintainers)

Top GitHub Comments

1reaction

jmcarpcommented, Mar 3, 2021

I’m not very familiar with the dbt internals, so it would probably take me some time to figure out, but I’d be happy to give this a try if nobody picks it up first.

1reaction

jtcohen6commented, Mar 1, 2021

This isn’t something we’re prioritizing now. FYI #2809 did add invocation_id as a label to all dbt-bigquery jobs, starting in v0.19.0. That invocation_id can be used to query INFORMATION_SCHEMA.JOBS_BY_* and calculate total time/spend per invocation; it can also be used to associate BigQuery query history with dbt run artifacts (docs), namely run_results.json. Those run artifacts will contain lots of useful metadata, including (e.g.) any environment variables prefixed with DBT_ENV_CUSTOM_ENV_.

I agree that dbt should be able to pass more information than just the invocation_id, though I think that’s a strong start. A --job-label flag that allows the user / orchestration tool to set one value for all nodes in an invocation should be straightforward to implement. The more I think about it, though, I find it functionally limiting but also one-off as an implementation, not well integrated with existing dbt constructs that seek to accomplish the same goal.

I do think the best version of this would make available the full query comment context as per-node job labels. That context, available to the query_comment macro, is defined in query_headers.py. I agree with @hui-zheng, it’s quite easy to pass environment variables or --vars into the query-comment config or query_comment macro, so this approach would solve for both use cases we’ve been discussing.

The string version of this comment—the default value, the string passed to the config, or the value returned by the custom macro—is available to the connection manager, via set_query_header and _add_query_comment. The execute method already calls _add_query_comment to prepend the comment to SQL before execution:

https://github.com/fishtown-analytics/dbt/blob/344a14416d22f0cfbeb56b9904092c8a4f38b1fc/plugins/bigquery/dbt/adapters/bigquery/connections.py#L333-L336

So here’s what I’m thinking about:

How would we enable dbt-bigquery users to turn this on? I’m thinking an additional option, nested under query_comment in dbt_project.yml, called job_label: true | false.
If query_comment.job_label is turned on, and the query comment config/macro returns a dict / JSON string (such as the advanced usage example in the docs), should dbt try to parse the returned value into a python dict, and pass each key-value pair as a separate label? I think yes; this should even work for the default query comment value.
If query_comment.job_label is turned on, and the query-comment returns an unstructured string, should dbt still try to pass the first 128 bytes (truncated if needed) as the value to a single label called query_comment? I still think yes, but I’m open to your thoughts on this point (and every point above).

Having written all that out, acknowledging that there are a few tricky pieces, I do think the requisite changes would be relatively self-contained in the codebase. Would anyone be interested in giving it a go?

Top Results From Across the Web

Adding labels to resources | BigQuery - Google Cloud

Shows how to add labels to datasets, tables, views, and jobs, and how to use a label as a tag. Includes examples in...

Support BigQuery Job Tags and Labels · Issue #2483 - GitHub

This issue focuses on BigQuery Jobs (such as Insert Jobs or Query Jobs). Describe alternatives you've considered. It's not possible to label or ......

How do I use labels in big query queries to track cost?

I have not been able to query or filter on specific labels, but can at least display them with this... SELECT service.description, cost, ......

airflow.providers.google.cloud.operators.bigquery

This operator deletes an existing dataset from your Project in Big query. ... That is still supported at runtime but is deprecated. Parameters....

Is there a way to apply labels to a BigQuery query...

We are running Apache Beam on Google CLoud Dataflow. One of our jobs SELECTs data from BigQuery and inserts each row onto a...