BigQueryOperator never uses the location parameter
See original GitHub issueApache Airflow version: composer-1.10.4-airflow-1.10.6
Kubernetes version (if you are using kubernetes) (use kubectl version
):
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.42", GitCommit:"42bef28c2031a74fc68840fce56834ff7ea08518", GitTreeState:"clean", BuildDate:"2020-06-02T16:07:00Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
What happened and What you think went wrong:
BigQueryOperator
does not use the location
parameter in order to specify query job location. Instead, it retrieves the automatically determined location from the HTTP request.
This happens because of the following code:
jobs = self.service.jobs()
job_data = {'configuration': configuration}
# Send query and wait for reply.
query_reply = jobs \
.insert(projectId=self.project_id, body=job_data) \
.execute(num_retries=self.num_retries)
self.running_job_id = query_reply['jobReference']['jobId']
if 'location' in query_reply['jobReference']:
location = query_reply['jobReference']['location']
else:
location = self.location
The configuration
block does not contain a location. The subsequent call in query_reply
apparently triggers some internal BigQuery logic to detect the location. This in practice falls back to US more often than not, leading to the job to quit with an error saying the datasets/tables referenced in the query do not exist. Specifying the location
argument, e.g. location='EU'
in the operator is thus not obeyed.
What you expected to happen:
Specifying location
as a BigQueryOperator
argument leads to execution of the query job in the correct location.
How to reproduce it: Set up a project and dataset in EU containing an example table.
Then, with an initialised local Airflow (airflow initdb
) that has been supplied with GCP/BigQuery default connection details, you may run the following code:
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.models import TaskInstance
dag = DAG(dag_id="anydag", start_date=datetime.now())
task = BigQueryOperator(
dag=dag,
task_id="query_task",
name="query_task",
write_disposition="WRITE_TRUNCATE",
use_legacy_sql=False,
destination_dataset_table=f"example_project.example_dataset.example_table",
sql="select * from `example_project.example_dataset.example_input_table`",
location="US"
)
ti = TaskInstance(task=task, execution_date=datetime.now())
task.execute(ti.get_template_context())
The location parameter will probably not be respected, instead your job will execute in EU.
Occasionally, regardless of location
specified, your job will execute in US
. This is difficult to reliably reproduce as it appears to be flaky and depend on which location the BigQuery service itself decided the query should run in.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (3 by maintainers)
@muscovitebob please upgrade Cloud Composer to the latest version.
Old environments do not support these packages.
Closing as the reported problem itself seems to be solved