BigqueryToGCS Operator Failing
See original GitHub issueApache Airflow Provider(s)
Versions of Apache Airflow Providers
apache-airflow-providers-google==2022.5.18+composer
Apache Airflow version
2.2.3
Operating System
Managed
Deployment
Composer
Deployment details
Environment Configuration:
- Composer version: composer-2.0.14
- Airflow version: airflow-2.2.3
- Image version: composer-2.0.14-airflow-2.2.3
Workload configuration:
- Scheduler - 2 vCPUs, 4 GB memory, 5 GB storage
- Number of schedulers - 1
- Web server - 1 vCPU, 4 GB memory, 5 GB storage
- Worker - 2 vCPUs, 7.5 GB memory, 10 GB storage
- Number of workers - Autoscaling between 1 and 3 workers
Core Infrastructure:
- Environment Size: Medium
Configuration Overrides:
No Airflow configuration overrides
What happened
Previously we were on apache-airflow-providers-google==6.4.0 (composer - 2.0.8 | airflow - 2.2.3) in which we were using the BigqueryToGCS operator in our DAGs as follows:
from airflow.providers.google.cloud.transfers import bigquery_to_gcs
###
###
###
###
bq_to_gcs_task = bigquery_to_gcs.BigQueryToGCSOperator(
task_id='BQ_TO_GCS',
source_project_dataset_table=bq_table,
destination_cloud_storage_uris=f"gs://{bucket_name}//{folder_name}//file*.csv",
export_format='CSV'
)
task_1 >> bq_to_gcs_task >> .. >> ..
This was working until we switched to apache-airflow-providers-google==2022.5.18+composer (composer - 2.0.14 | airflow - 2.2.3). Now every time the operator is executed, the task goes to fail state in airflow. However, I observed that the CSV files are created as expected from the operator. What the logs state is that the operator is not able to find the bigquery job it executed, hence fails. Task logs are follows:
[2022-06-05, 06:23:02 UTC] {bigquery_to_gcs.py:120} INFO - Executing extract of <project_id>.<dataset_name>.<table_name> into: gs://<bucket_name>//<folder_name>//file*.csv
[2022-06-05, 06:23:03 UTC] {warnings.py:109} WARNING - /opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py:1942: DeprecationWarning: This method is deprecated. Please use `BigQueryHook.insert_job` method.
warnings.warn(
[2022-06-05, 06:23:05 UTC] {taskinstance.py:1702} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1330, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1457, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1513, in _execute_task
result = execute_callable(context=context)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py", line 141, in execute
job = hook.get_job(job_id=job_id).to_api_repr()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
return func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1492, in get_job
job = client.get_job(job_id=job_id, project=project_id, location=location)
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 2066, in get_job
resource = self._call_api(
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
return call()
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
return retry_target(
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
return target()
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/<project_id>/jobs/<job_id>?projection=full&prettyPrint=false: Not found: Job <project_id>:<job_id>
[2022-06-05, 06:23:05 UTC] {standard_task_runner.py:89} ERROR - Failed to execute job 50257 for task BQ_TO_GCS
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
args.func(args, dag=self.dag)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/cli.py", line 94, in wrapper
return f(*args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 304, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 109, in _run_task_by_selected_method
_run_raw_task(args, ti)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 182, in _run_raw_task
ti._run_raw_task(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/session.py", line 70, in wrapper
return func(*args, session=session, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1330, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1457, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1513, in _execute_task
result = execute_callable(context=context)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py", line 141, in execute
job = hook.get_job(job_id=job_id).to_api_repr()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
return func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1492, in get_job
job = client.get_job(job_id=job_id, project=project_id, location=location)
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 2066, in get_job
resource = self._call_api(
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
return call()
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
return retry_target(
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
return target()
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/<project_id>/jobs/<job_id>?projection=full&prettyPrint=false: Not found: Job <project_id>:<job_id>
I am also not able to install the package apache-airflow-providers-google==2022.5.18+composer locally as pip
is not able to locate this, nor am I able to see it in the releases.
What you think should happen instead
The task should be completed with a success status given the query is all right and executes.
How to reproduce
No response
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Comments:10 (4 by maintainers)
Closind with #24461
Related https://github.com/apache/airflow/pull/24461