question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BigqueryToGCS Operator Failing

See original GitHub issue

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==2022.5.18+composer

Apache Airflow version

2.2.3

Operating System

Managed

Deployment

Composer

Deployment details

Environment Configuration:

  • Composer version: composer-2.0.14
  • Airflow version: airflow-2.2.3
  • Image version: composer-2.0.14-airflow-2.2.3

Workload configuration:

  • Scheduler - 2 vCPUs, 4 GB memory, 5 GB storage
  • Number of schedulers - 1
  • Web server - 1 vCPU, 4 GB memory, 5 GB storage
  • Worker - 2 vCPUs, 7.5 GB memory, 10 GB storage
  • Number of workers - Autoscaling between 1 and 3 workers

Core Infrastructure:

  • Environment Size: Medium

Configuration Overrides:

No Airflow configuration overrides

What happened

Previously we were on apache-airflow-providers-google==6.4.0 (composer - 2.0.8 | airflow - 2.2.3) in which we were using the BigqueryToGCS operator in our DAGs as follows:

from airflow.providers.google.cloud.transfers import bigquery_to_gcs

###
###
###
###

bq_to_gcs_task = bigquery_to_gcs.BigQueryToGCSOperator(
  task_id='BQ_TO_GCS',
  source_project_dataset_table=bq_table,
  destination_cloud_storage_uris=f"gs://{bucket_name}//{folder_name}//file*.csv",
  export_format='CSV'
)

task_1 >> bq_to_gcs_task  >> .. >> ..

This was working until we switched to apache-airflow-providers-google==2022.5.18+composer (composer - 2.0.14 | airflow - 2.2.3). Now every time the operator is executed, the task goes to fail state in airflow. However, I observed that the CSV files are created as expected from the operator. What the logs state is that the operator is not able to find the bigquery job it executed, hence fails. Task logs are follows:

[2022-06-05, 06:23:02 UTC] {bigquery_to_gcs.py:120} INFO - Executing extract of <project_id>.<dataset_name>.<table_name> into: gs://<bucket_name>//<folder_name>//file*.csv
[2022-06-05, 06:23:03 UTC] {warnings.py:109} WARNING - /opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py:1942: DeprecationWarning: This method is deprecated. Please use `BigQueryHook.insert_job` method.
  warnings.warn(

[2022-06-05, 06:23:05 UTC] {taskinstance.py:1702} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1330, in _run_raw_task
    self._execute_task_with_callbacks(context)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1457, in _execute_task_with_callbacks
    result = self._execute_task(context, self.task)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1513, in _execute_task
    result = execute_callable(context=context)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py", line 141, in execute
    job = hook.get_job(job_id=job_id).to_api_repr()
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
    return func(self, *args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1492, in get_job
    job = client.get_job(job_id=job_id, project=project_id, location=location)
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 2066, in get_job
    resource = self._call_api(
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
    return call()
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
    return retry_target(
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/<project_id>/jobs/<job_id>?projection=full&prettyPrint=false: Not found: Job <project_id>:<job_id>
[2022-06-05, 06:23:05 UTC] {standard_task_runner.py:89} ERROR - Failed to execute job 50257 for task BQ_TO_GCS
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
    args.func(args, dag=self.dag)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/cli.py", line 94, in wrapper
    return f(*args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 304, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 109, in _run_task_by_selected_method
    _run_raw_task(args, ti)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 182, in _run_raw_task
    ti._run_raw_task(
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/session.py", line 70, in wrapper
    return func(*args, session=session, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1330, in _run_raw_task
    self._execute_task_with_callbacks(context)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1457, in _execute_task_with_callbacks
    result = self._execute_task(context, self.task)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1513, in _execute_task
    result = execute_callable(context=context)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py", line 141, in execute
    job = hook.get_job(job_id=job_id).to_api_repr()
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
    return func(self, *args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1492, in get_job
    job = client.get_job(job_id=job_id, project=project_id, location=location)
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 2066, in get_job
    resource = self._call_api(
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
    return call()
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
    return retry_target(
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/<project_id>/jobs/<job_id>?projection=full&prettyPrint=false: Not found: Job <project_id>:<job_id>

I am also not able to install the package apache-airflow-providers-google==2022.5.18+composer locally as pip is not able to locate this, nor am I able to see it in the releases.

What you think should happen instead

The task should be completed with a success status given the query is all right and executes.

How to reproduce

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
potiukcommented, Jun 19, 2022

Closind with #24461

1reaction
lihancommented, Jun 15, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

Job Not Found when transferring data from BigQuery into ...
Apparently this seems to be a bug introduced in apache-airflow-providers-google version v7.0.0 . Also note that the file transfer from BQ ...
Read more >
airflow.providers.google.cloud.transfers.bigquery_to_gcs
[docs]class BigQueryToGCSOperator(BaseOperator): """ Transfers a BigQuery ... AirflowException(f"BigQuery job {job.job_id} failed: {job.error_result}") def ...
Read more >
Using Google Cloud Platform Operators in Apache Airflow
15 bigquery_to_gcs = BigQueryToGCSOperator( ... solution when it comes to the problem of composing multiple text files into one gzip file.
Read more >
apache-airflow-providers-google 1.0.0rc1 - PyPI
f14f37971, 2020-09-07, [AIRFLOW-10672] Refactor BigQueryToGCSOperator to use new method ... bfa5a8d5f, 2020-08-15, CI: Fix failing docs-build (#10342).
Read more >
gcs to bigquery airflow example - You.com | The Search ...
You should use BigQueryToGCSOperator. ... I just end up suggesting how to solve your exact problem but note that may be better ways...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found