Lineage, BigQuery - lineage from tables/views created via Terraform (REST API) is not being ingested
See original GitHub issueDescribe the bug
Metadata ingestion doesn’t extract lineage from tables or views created via Terraform (that uses REST API) as DataHub can’t find relevant logs in cloudaudit_googleapis_com_data_access
table.
Screenshots Visualization of the issue:
To Reproduce Prerequisite: Cloud Logging needs to sink logs to BigQuery Steps to reproduce the behavior:
- Create Terraform resources for BQ table and view:
resource "google_bigquery_table" "test_tbl" {
project = "${YOUR_PROJECT_ID}"
dataset_id = "${YOUR_DATASET_ID}"
table_id = "test_tbl"
schema = jsonencode([
{
"mode" : "NULLABLE",
"name" : "col_1",
"type" : "STRING"
},
{
"mode" : "NULLABLE",
"name" : "col_2",
"type" : "INTEGER"
}
])
deletion_protection = false
}
resource "google_bigquery_table" "test_view" {
project = "${YOUR_PROJECT_ID}"
dataset_id = "${YOUR_DATASET_ID}"
table_id = "test_view"
view {
query = "YOUR SQL QUERY"
use_legacy_sql = false
}
deletion_protection = false
}
- Apply Terraform plan to your GCP project
- Try to run the following query against exported audit logs table
cloudaudit_googleapis_com_data_access
. You won’t find any traces of created objects
SELECT
timestamp,
logName,
insertId,
protopayload_auditlog AS protoPayload,
protopayload_auditlog.metadataJson AS metadata
FROM
`YOUR_PROJECT.YOUR_DATASET.cloudaudit_googleapis_com_data_access`
WHERE
timestamp >= "2022-01-01"
AND timestamp < "2022-12-31"
AND protopayload_auditlog.serviceName="bigquery.googleapis.com"
AND JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobStatus.jobState") = "DONE"
AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobStatus.errorResults") IS NULL
AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobConfig.queryConfig") IS NOT NULL
AND (JSON_EXTRACT(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobConfig.queryConfig.query") LIKE "%test_tbl%"
OR JSON_EXTRACT(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobConfig.queryConfig.query") LIKE "%test_view%")
AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobConfig.queryConfig.query") NOT LIKE "%cloudaudit_googleapis_com_data_access%"
AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobConfig.queryConfig.query") NOT LIKE "%cloudaudit_googleapis_com_activity%"
- Try to run the following query against exported audit logs table
cloudaudit_googleapis_com_activity
. You will find entries for tables/views created from Terraform and (theoretically) any other tool that uses REST API
SELECT
timestamp,
logName,
insertId,
protopayload_auditlog AS protoPayload,
protopayload_auditlog.methodName AS methodName,
protopayload_auditlog.resourceName AS resourceName,
protopayload_auditlog.metadataJson AS metadata
FROM
`YOUR_PROJECT.YOUR_DATASET.cloudaudit_googleapis_com_activity`
WHERE
timestamp >= "2022-01-01"
AND timestamp < "2022-12-31"
AND protopayload_auditlog.serviceName="bigquery.googleapis.com"
AND protopayload_auditlog.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
AND (protopayload_auditlog.resourceName LIKE "%test_tbl"
OR protopayload_auditlog.resourceName LIKE "%test_view")
ORDER BY
timestamp DESC
How protopayload_auditlog.metadataJson
looks like for BQ table:
{
"@type": "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata",
"tableCreation": {
"reason": "TABLE_INSERT_REQUEST",
"table": {
"createTime": "2022-04-26T07:46:53.862Z",
"schemaJson": "{\n \"fields\": [{\n \"name\": \"col_1\",\n \"type\": \"STRING\",\n \"mode\": \"NULLABLE\"\n }, {\n \"name\": \"col_2\",\n \"type\": \"INTEGER\",\n \"mode\": \"NULLABLE\"\n }]\n}",
"tableName": "projects/project/datasets/dataset/tables/test_tbl",
"updateTime": "2022-04-26T07:46:53.902Z"
}
}
}
How protopayload_auditlog.metadataJson
looks like for BQ view:
{
"@type": "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata",
"tableCreation": {
"reason": "TABLE_INSERT_REQUEST",
"table": {
"createTime": "2022-04-26T07:46:53.893Z",
"schemaJson": "{\n \"fields\": [{\n \"name\": \"col_1\",\n \"type\": \"STRING\",\n \"mode\": \"NULLABLE\"\n }, {\n \"name\": \"col_2\",\n \"type\": \"INTEGER\",\n \"mode\": \"NULLABLE\"\n }]\n}",
"tableName": "projects/project/datasets/dataset/tables/test_view",
"updateTime": "2022-04-26T07:46:54.087Z",
"view": {
"query": "SELECT * FROM `project.dataset.tbl_1` UNION ALL SELECT * FROM `project.dataset.tbl_2`"
}
}
}
}
Expected behavior
DataHub needs to cover cases when tables/views were created via REST API by Terraform or other tools. That is especially important for extracting view lineage as its creation won’t be visible in cloudaudit_googleapis_com_data_access
as the view was not created by a BQ job but an API call instead.
That implies that query contents from protopayload_auditlog.metadataJson.tableCreation.table.view.query
needs to be parsed by sqlparser and pushed to sqllineage.
This issue seems to be relevant not only for exported audit log tables (use_exported_bigquery_audit_metadata: true
) but also for direct extracts from Cloud Logging as query templates (both v1 and v2) are targeting BQ job service only.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
Hi @vgaidass, thanks so much for the level of detail here - definitely something we need to address.
Hi @anshbansal - please take a look at this one!
This issue was closed because it has been inactive for 30 days since being marked as stale.