Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BigQueryHook refactor + deterministic BQ Job ID

See original GitHub issue

Description

Looking at the code it seems like a lot of the logic in the BQ Hook is already implemented in the google API python library. This includes job polling, a nicer way to use job config and of course las the validations that we now do manually. It would be ice to make use of these and simplify the code.

My idea is then to refactor the run_<job> methods to take the google job config and a deterministic job id. This would help in case of a pod dies because of any given reason, we’d restart polling for the async job previously started (I apologize for the crappy explanation).

See my hacky spike below:

This is the job id definition for reference job_id = re.sub(r"[^0-9a-zA-Z_\-]+", "-", f"{self.dag_id}_{self.task_id}_{context['execution_date'].isoformat()}__try_0")

and here roughly how un query would work

def run_query(self, job_id: str, job_config: QueryJobConfig, sql: str, destination_dataset_table: str = None) -> str:
        def _recurse(job_id: str):
            [j, try_num] = job_id.split("__try_")
            new_job_id = f"{j}__try_{int(try_num) + 1}"
            return run_query(new_job_id, job_config, sql)

        def run_query(job_id: str, job_config: QueryJobConfig, sql: str):
            if not self.project_id:
                raise ValueError("The project_id should be set")

            if destination_dataset_table is not None:
                job_config.destination = TableReference.from_string(destination_dataset_table, self.project_id)

            try:
                job: QueryJob = self.client.get_job(job_id, self.project_id)
                if job.state == 'RUNNING':
                    if job.query != sql:
                        job.cancel()
                        self.log.info(f"Job {job_id} found, but sql is different. "
                                      f"Cancelling the current job and starting a new one")
                        return _recurse(job_id)
                    self.log.info(f"Job {job_id} still running, re-starting to poll.")
                    return job.result()
                else:
                    self.log.info(f"Job {job_id} already executed once. Restarting")
                    return _recurse(job_id)
            except NotFound:
                self.log.info(f"Job {job_id} not found, starting a new job.")
            job: QueryJob = self.client.query(sql, job_config, job_id, project=self.project_id)
            self.log.info(f"Running Job {job_id}...")
            return job.result()

        return run_query(job_id, job_config, sql)

the encoded _try<try_num> is not the airflow but a secondary try in case the task is cleared since BQ Job Ids are a unique key and can’t be re-used.

Use case / motivation Trying to use the functionalities in the google cloud library rather than re-implementing them ourselves. This would allow us to pass through a Deterministic Job ID too, useful for picking up jobs which are still running in case a pod dies. Related Issues

Issue Analytics

State:
Created 3 years ago
Comments:14 (13 by maintainers)

Top GitHub Comments

3reactions

turbaszekcommented, May 27, 2020

Summoning @edejong to hear his opinion 😃

1reaction

albertocalderaricommented, Jun 24, 2020

@turbaszek yeah sort of, it’s not as simple, I really rather have a quick call, are you on airflow’s slack?

Top Results From Across the Web

Managing jobs | BigQuery - Google Cloud

Go to the BigQuery page. Go to BigQuery. Select the type of job history you want to view: To display information of your...

[GitHub] [airflow] turbaszek commented on issue #8903 ...

... turbaszek commented on issue #8903: BigQueryHook refactor + deterministic BQ Job ID ... Hi @albertocalderari, I'm curently workining on refactor of BQ...

airflow.providers.google.cloud.hooks.bigquery

This module contains a BigQuery Hook, as well as a very basic PEP 249 ... None) – This contains params configuration applied for...

How to fetch BQ job_id using Spark-BQ connector?

We have requirement to connect view {region_id}.INFORMATION_SCHEMA.JOBS and fetch metadata of BQ we execute. We require bq_job_id to perform ...

BigQuery hook doesn't work fully for BigQuery dataset in ...

We were using cloud composer to do a log data load jobs. ... line 981, in run_with_configuratio jobId=self.running_job_id).execute( File ...