question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lineage, BigQuery - lineage from tables/views created via Terraform (REST API) is not being ingested

See original GitHub issue

Describe the bug Metadata ingestion doesn’t extract lineage from tables or views created via Terraform (that uses REST API) as DataHub can’t find relevant logs in cloudaudit_googleapis_com_data_access table.

Screenshots Visualization of the issue:

image

To Reproduce Prerequisite: Cloud Logging needs to sink logs to BigQuery Steps to reproduce the behavior:

  1. Create Terraform resources for BQ table and view:
resource "google_bigquery_table" "test_tbl" {
  project    = "${YOUR_PROJECT_ID}"
  dataset_id = "${YOUR_DATASET_ID}"
  table_id   = "test_tbl"
  schema = jsonencode([
    {
      "mode" : "NULLABLE",
      "name" : "col_1",
      "type" : "STRING"
    },
    {
      "mode" : "NULLABLE",
      "name" : "col_2",
      "type" : "INTEGER"
    }
  ])

  deletion_protection = false
}

resource "google_bigquery_table" "test_view" {
  project    = "${YOUR_PROJECT_ID}"
  dataset_id = "${YOUR_DATASET_ID}"
  table_id   = "test_view"
  view {
    query          = "YOUR SQL QUERY"
    use_legacy_sql = false
  }

  deletion_protection = false
}
  1. Apply Terraform plan to your GCP project
  2. Try to run the following query against exported audit logs table cloudaudit_googleapis_com_data_access. You won’t find any traces of created objects
SELECT
  timestamp,
  logName,
  insertId,
  protopayload_auditlog AS protoPayload,
  protopayload_auditlog.metadataJson AS metadata
FROM
  `YOUR_PROJECT.YOUR_DATASET.cloudaudit_googleapis_com_data_access`
WHERE
  timestamp >= "2022-01-01"
  AND timestamp < "2022-12-31"
  AND protopayload_auditlog.serviceName="bigquery.googleapis.com"
  AND JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobStatus.jobState") = "DONE"
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobStatus.errorResults") IS NULL
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobConfig.queryConfig") IS NOT NULL
  AND (JSON_EXTRACT(protopayload_auditlog.metadataJson,
      "$.jobChange.job.jobConfig.queryConfig.query") LIKE "%test_tbl%"
    OR JSON_EXTRACT(protopayload_auditlog.metadataJson,
      "$.jobChange.job.jobConfig.queryConfig.query") LIKE "%test_view%")
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobConfig.queryConfig.query") NOT LIKE "%cloudaudit_googleapis_com_data_access%"
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobConfig.queryConfig.query") NOT LIKE "%cloudaudit_googleapis_com_activity%"
  1. Try to run the following query against exported audit logs table cloudaudit_googleapis_com_activity. You will find entries for tables/views created from Terraform and (theoretically) any other tool that uses REST API
SELECT
  timestamp,
  logName,
  insertId,
  protopayload_auditlog AS protoPayload,
  protopayload_auditlog.methodName AS methodName,
  protopayload_auditlog.resourceName AS resourceName,
  protopayload_auditlog.metadataJson AS metadata
FROM
  `YOUR_PROJECT.YOUR_DATASET.cloudaudit_googleapis_com_activity`
WHERE
  timestamp >= "2022-01-01"
  AND timestamp < "2022-12-31"
  AND protopayload_auditlog.serviceName="bigquery.googleapis.com"
  AND protopayload_auditlog.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
  AND (protopayload_auditlog.resourceName LIKE "%test_tbl"
    OR protopayload_auditlog.resourceName LIKE "%test_view")
ORDER BY
  timestamp DESC

How protopayload_auditlog.metadataJson looks like for BQ table:

{
    "@type": "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata",
    "tableCreation": {
        "reason": "TABLE_INSERT_REQUEST",
        "table": {
            "createTime": "2022-04-26T07:46:53.862Z",
            "schemaJson": "{\n  \"fields\": [{\n    \"name\": \"col_1\",\n    \"type\": \"STRING\",\n    \"mode\": \"NULLABLE\"\n  }, {\n    \"name\": \"col_2\",\n    \"type\": \"INTEGER\",\n    \"mode\": \"NULLABLE\"\n  }]\n}",
            "tableName": "projects/project/datasets/dataset/tables/test_tbl",
            "updateTime": "2022-04-26T07:46:53.902Z"
        }
    }
}

How protopayload_auditlog.metadataJson looks like for BQ view:

{
    "@type": "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata",
    "tableCreation": {
        "reason": "TABLE_INSERT_REQUEST",
        "table": {
            "createTime": "2022-04-26T07:46:53.893Z",
            "schemaJson": "{\n  \"fields\": [{\n    \"name\": \"col_1\",\n    \"type\": \"STRING\",\n    \"mode\": \"NULLABLE\"\n  },  {\n    \"name\": \"col_2\",\n    \"type\": \"INTEGER\",\n    \"mode\": \"NULLABLE\"\n  }]\n}",
            "tableName": "projects/project/datasets/dataset/tables/test_view",
            "updateTime": "2022-04-26T07:46:54.087Z",
            "view": {
                "query": "SELECT * FROM `project.dataset.tbl_1` UNION ALL SELECT * FROM `project.dataset.tbl_2`"
            }
        }
    }
}

Expected behavior DataHub needs to cover cases when tables/views were created via REST API by Terraform or other tools. That is especially important for extracting view lineage as its creation won’t be visible in cloudaudit_googleapis_com_data_access as the view was not created by a BQ job but an API call instead.

That implies that query contents from protopayload_auditlog.metadataJson.tableCreation.table.view.query needs to be parsed by sqlparser and pushed to sqllineage.

This issue seems to be relevant not only for exported audit log tables (use_exported_bigquery_audit_metadata: true) but also for direct extracts from Cloud Logging as query templates (both v1 and v2) are targeting BQ job service only.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
maggiehayscommented, Apr 26, 2022

Hi @vgaidass, thanks so much for the level of detail here - definitely something we need to address.

Hi @anshbansal - please take a look at this one!

0reactions
github-actions[bot]commented, Nov 16, 2022

This issue was closed because it has been inactive for 30 days since being marked as stale.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Building a BigQuery data lineage system using audit logs, Pub ...
Extract data lineage by using Dataflow streaming pipelines. ... You can access the audit logs using the Cloud Logging API.
Read more >
Datahubspace | Ingestion - Linen
My theory is that Terraform creates objects via BigQuery API rather than SQL jobs so they naturally don't appear under "BigQuery job" part...
Read more >
Building End-to-End Field Level Lineage for Modern Data ...
Automating lineage creation and abstracting metadata down to the field-level cuts down on the time and resources required to conduct root ...
Read more >
Docly Documentation - Stemma Docs
Stemma ingests DBT catalog and manifest data, automatically importing data such as lineage and table and column descriptions: ...
Read more >
Google Cloud Platform Resources Big Data - GCP Weekly
A Senior's Guide to Kickstart your BigQuery Journey - Missing basics you need to know when using BigQuery. Big Data Cloud Dataproc June...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found