Strange inconsistent output lineage Databricks
See original GitHub issueI’m using version 0.13.0 in Databricks runtime 9.1 and running into a strange ADLS delta table issue in my prod environment I’m unable to really reproduce in any other environment.
I’m running a sort of orchestration notebook that then starts multiple child notebooks via job rest api on the same cluster with OpenLineage enabled. For most Databricks workspaces this works fine but for one I’m seeing inconsistent and mostly missing output lineage.
I created a simple test orchestration notebook that only starts one child in the prod workspace and another in a test workspace. Both 0.13.0 OpenLineage and 9.1 Databricks. In test I always get correct output lineage and in prod it’s very inconsistent. In both, inputs look fine. The notebook should have output lineage for 2 delta tables.
I added multiple log points into the OpenLineage spark library and here is what I noticed:
- In the test workspace the both delta tables output lineage plan is
ReplaceTableAsSelect
- In prod I either get zero output lineage and
AbstractDatabricksHandler
is never called (I added additional logs tohasClasses
andisClass
) or I only get output lineage for one delta table asReplaceTableAsSelect
but never both - I added multiple logs to
sparkSQLExecStart
andsparkSQLExecEnd
inOpenLineageSparkListener
and it looks like manySparkListenerSQLExecutionStart
events are not adding a context tosparkSqlExecutionRegistry
because inContextFactory.createSparkSQLExecutionContext
the methodSQLExecution.getQueryExecution(executionId)
does not find anything. I see manySparkListenerSQLExecutionEnd
events that cannot find a start context insparkSqlExecutionRegistry
and many of them have aLogicalPlan
of typeReplaceTableAsSelect
which I’m assuming I want to capture for the expected lineage - In prod, most of the time when we do get output lineage the Databricks
EnvironmentFacet
does not containspark.databricks.job.runId
orspark.databricks.job.id
with the values corresponding to the child notebook run like in the test workspace. Instead it containsspark.databricks.clusterUsageTags.clusterName
which has the run and job id of the parent notebook
The notebooks are in python and the command to write to the delta table is:
spark_dataframe
.write
.format('delta')
.mode("Overwrite")
.saveAsTable(write_location)
Could use some guidance on this.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (4 by maintainers)
@pawel-big-lebowski Awesome, glad there is an eventual fix for this from Spark. Just wish I had seen your comment before I spent all this time looking for my issue haha.
Thanks for the help! I’ll close this issue since the problem and fix are already known.
@lazowmich @mobuchowski Great investigation! I think I had some similar findings written here: https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556
The good thing is that Spark’s
master
branch already contains a fix for that andSparkListenerSQLExecutionStart
will containQueryExecution
. The bad thing is that, the change is not present in Spark3.3.1-rc1
branch.