question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BigQueryHook `create_empty_dataset` missing `datasetReference`

See original GitHub issue

Apache Airflow version: 2.0.2

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.19

Environment:

  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 10 (buster)
  • Kernel (e.g. uname -a): x86_64 GNU/Linux
  • Install tools:
  • Others:

What happened:

Using the dataset_reference argument for BigQueryCreateEmptyDatasetOperator to set table expiration throws an error:

[2021-05-27 00:03:23,999] {taskinstance.py:1455} ERROR - 'datasetReference'
Traceback (most recent call last):
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1315, in _execute_task
    result = task_copy.execute(context=context)
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/providers/google/cloud/operators/bigquery.py", line 1419, in execute
    bq_hook.create_empty_dataset(
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 425, in inner_wrapper
    return func(self, *args, **kwargs)
  File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 414, in create_empty_dataset
    specified_param = dataset_reference["datasetReference"].get(param)
KeyError: 'datasetReference'

What you expected to happen:

I expected no error, the operator shows a similar dataset_reference dict in its documentation:

create_new_dataset = BigQueryCreateEmptyDatasetOperator(                
    dataset_id='new-dataset',                
    project_id='my-project',                
    dataset_reference={"friendlyName": "New Dataset"}, 
    gcp_conn_id='_my_gcp_conn_',                
    task_id='newDatasetCreator',                
    dag=dag
)

How to reproduce it:

create_dataset = BigQueryCreateEmptyDatasetOperator(
    task_id='create_dataset',
    project_id=PROJECT,
    dataset_id=DATASET,
    dataset_reference={"defaultTableExpirationMs": str(1000 * 60 * 60 * 24 * 30)},
    dag=dag
)

Anything else we need to know:

The create_empty_dataset method from the BigQueryHook class expects datasetReference to always be a key in the dictionary:

https://github.com/apache/airflow/blob/86768859c689bf02ced96e71996a3a30da1b5888/airflow/providers/google/cloud/hooks/bigquery.py#L442

I was able to fix the error adding it:

create_dataset = BigQueryCreateEmptyDatasetOperator(
    task_id='create_dataset',
    project_id=PROJECT,
    dataset_id=DATASET,
    dataset_reference={"datasetReference": {}, "defaultTableExpirationMs": str(1000 * 60 * 60 * 24 * 30)},
    dag=dag
)

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
rbiegaczcommented, May 30, 2022

@eladkal - please, assign this item to me - I will try to fix it.

0reactions
eladkalcommented, Sep 10, 2022

@g-saxena assigned

Read more comments on GitHub >

github_iconTop Results From Across the Web

airflow.providers.google.cloud.hooks.bigquery
This module contains a BigQuery Hook, as well as a very basic PEP 249 ... allow_jagged_rows (bool) – Accept rows that are missing...
Read more >
Class DatasetReference (3.4.0) | Python client library
Construct a dataset reference from dataset ID string. Parameters. Name, Description. dataset_id, str. A dataset ID in standard SQL format. If default_project ...
Read more >
Airflow BigQueryHook ValueError: The project_id should be set
Got the same ValueError. I went through every piece of documentation and still couldn't find any solution maybe I am missing something? Airflow ......
Read more >
BigQuery hook doesn't work fully for BigQuery dataset in ...
We were using cloud composer to do a log data load jobs. Recently we started to using it with BigQuery dataset that's not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found