Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strange inconsistent output lineage Databricks

See original GitHub issue

I’m using version 0.13.0 in Databricks runtime 9.1 and running into a strange ADLS delta table issue in my prod environment I’m unable to really reproduce in any other environment.

I’m running a sort of orchestration notebook that then starts multiple child notebooks via job rest api on the same cluster with OpenLineage enabled. For most Databricks workspaces this works fine but for one I’m seeing inconsistent and mostly missing output lineage.

I created a simple test orchestration notebook that only starts one child in the prod workspace and another in a test workspace. Both 0.13.0 OpenLineage and 9.1 Databricks. In test I always get correct output lineage and in prod it’s very inconsistent. In both, inputs look fine. The notebook should have output lineage for 2 delta tables.

I added multiple log points into the OpenLineage spark library and here is what I noticed:

In the test workspace the both delta tables output lineage plan is ReplaceTableAsSelect
In prod I either get zero output lineage and AbstractDatabricksHandler is never called (I added additional logs to hasClasses and isClass) or I only get output lineage for one delta table as ReplaceTableAsSelect but never both
I added multiple logs to sparkSQLExecStart and sparkSQLExecEnd in OpenLineageSparkListener and it looks like many SparkListenerSQLExecutionStart events are not adding a context to sparkSqlExecutionRegistry because in ContextFactory.createSparkSQLExecutionContext the method SQLExecution.getQueryExecution(executionId) does not find anything. I see many SparkListenerSQLExecutionEnd events that cannot find a start context in sparkSqlExecutionRegistry and many of them have a LogicalPlan of type ReplaceTableAsSelect which I’m assuming I want to capture for the expected lineage
In prod, most of the time when we do get output lineage the Databricks EnvironmentFacet does not contain spark.databricks.job.runId or spark.databricks.job.id with the values corresponding to the child notebook run like in the test workspace. Instead it contains spark.databricks.clusterUsageTags.clusterName which has the run and job id of the parent notebook

The notebooks are in python and the command to write to the delta table is:

spark_dataframe
   .write
   .format('delta')
   .mode("Overwrite")
   .saveAsTable(write_location)

Could use some guidance on this.

Issue Analytics

State:
Created a year ago
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

lazowmichcommented, Sep 26, 2022

@pawel-big-lebowski Awesome, glad there is an eventual fix for this from Spark. Just wish I had seen your comment before I spent all this time looking for my issue haha.

Thanks for the help! I’ll close this issue since the problem and fix are already known.

0reactions

pawel-big-lebowskicommented, Sep 22, 2022

@lazowmich @mobuchowski Great investigation! I think I had some similar findings written here: https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556

The good thing is that Spark’s master branch already contains a fix for that and SparkListenerSQLExecutionStart will contain QueryExecution. The bad thing is that, the change is not present in Spark 3.3.1-rc1 branch.