question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BigQueryHook refactor + deterministic BQ Job ID

See original GitHub issue

Description

Looking at the code it seems like a lot of the logic in the BQ Hook is already implemented in the google API python library. This includes job polling, a nicer way to use job config and of course las the validations that we now do manually. It would be ice to make use of these and simplify the code.

My idea is then to refactor the run_<job> methods to take the google job config and a deterministic job id. This would help in case of a pod dies because of any given reason, we’d restart polling for the async job previously started (I apologize for the crappy explanation).

See my hacky spike below:

This is the job id definition for reference job_id = re.sub(r"[^0-9a-zA-Z_\-]+", "-", f"{self.dag_id}_{self.task_id}_{context['execution_date'].isoformat()}__try_0")

and here roughly how un query would work

def run_query(self, job_id: str, job_config: QueryJobConfig, sql: str, destination_dataset_table: str = None) -> str:
        def _recurse(job_id: str):
            [j, try_num] = job_id.split("__try_")
            new_job_id = f"{j}__try_{int(try_num) + 1}"
            return run_query(new_job_id, job_config, sql)

        def run_query(job_id: str, job_config: QueryJobConfig, sql: str):
            if not self.project_id:
                raise ValueError("The project_id should be set")

            if destination_dataset_table is not None:
                job_config.destination = TableReference.from_string(destination_dataset_table, self.project_id)

            try:
                job: QueryJob = self.client.get_job(job_id, self.project_id)
                if job.state == 'RUNNING':
                    if job.query != sql:
                        job.cancel()
                        self.log.info(f"Job {job_id} found, but sql is different. "
                                      f"Cancelling the current job and starting a new one")
                        return _recurse(job_id)
                    self.log.info(f"Job {job_id} still running, re-starting to poll.")
                    return job.result()
                else:
                    self.log.info(f"Job {job_id} already executed once. Restarting")
                    return _recurse(job_id)
            except NotFound:
                self.log.info(f"Job {job_id} not found, starting a new job.")
            job: QueryJob = self.client.query(sql, job_config, job_id, project=self.project_id)
            self.log.info(f"Running Job {job_id}...")
            return job.result()

        return run_query(job_id, job_config, sql)

the encoded _try<try_num> is not the airflow but a secondary try in case the task is cleared since BQ Job Ids are a unique key and can’t be re-used.

Use case / motivation Trying to use the functionalities in the google cloud library rather than re-implementing them ourselves. This would allow us to pass through a Deterministic Job ID too, useful for picking up jobs which are still running in case a pod dies. Related Issues

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (13 by maintainers)

github_iconTop GitHub Comments

3reactions
turbaszekcommented, May 27, 2020

Summoning @edejong to hear his opinion 😃

1reaction
albertocalderaricommented, Jun 24, 2020

@turbaszek yeah sort of, it’s not as simple, I really rather have a quick call, are you on airflow’s slack?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Managing jobs | BigQuery - Google Cloud
Go to the BigQuery page. Go to BigQuery. Select the type of job history you want to view: To display information of your...
Read more >
[GitHub] [airflow] turbaszek commented on issue #8903 ...
... turbaszek commented on issue #8903: BigQueryHook refactor + deterministic BQ Job ID ... Hi @albertocalderari, I'm curently workining on refactor of BQ...
Read more >
airflow.providers.google.cloud.hooks.bigquery
This module contains a BigQuery Hook, as well as a very basic PEP 249 ... None) – This contains params configuration applied for...
Read more >
How to fetch BQ job_id using Spark-BQ connector?
We have requirement to connect view {region_id}.INFORMATION_SCHEMA.JOBS and fetch metadata of BQ we execute. We require bq_job_id to perform ...
Read more >
BigQuery hook doesn't work fully for BigQuery dataset in ...
We were using cloud composer to do a log data load jobs. ... line 981, in run_with_configuratio jobId=self.running_job_id).execute( File ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found