question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strange inconsistent output lineage Databricks

See original GitHub issue

I’m using version 0.13.0 in Databricks runtime 9.1 and running into a strange ADLS delta table issue in my prod environment I’m unable to really reproduce in any other environment.

I’m running a sort of orchestration notebook that then starts multiple child notebooks via job rest api on the same cluster with OpenLineage enabled. For most Databricks workspaces this works fine but for one I’m seeing inconsistent and mostly missing output lineage.

I created a simple test orchestration notebook that only starts one child in the prod workspace and another in a test workspace. Both 0.13.0 OpenLineage and 9.1 Databricks. In test I always get correct output lineage and in prod it’s very inconsistent. In both, inputs look fine. The notebook should have output lineage for 2 delta tables.

I added multiple log points into the OpenLineage spark library and here is what I noticed:

  • In the test workspace the both delta tables output lineage plan is ReplaceTableAsSelect
  • In prod I either get zero output lineage and AbstractDatabricksHandler is never called (I added additional logs to hasClasses and isClass) or I only get output lineage for one delta table as ReplaceTableAsSelect but never both
  • I added multiple logs to sparkSQLExecStart and sparkSQLExecEnd in OpenLineageSparkListener and it looks like many SparkListenerSQLExecutionStart events are not adding a context to sparkSqlExecutionRegistry because in ContextFactory.createSparkSQLExecutionContext the method SQLExecution.getQueryExecution(executionId) does not find anything. I see many SparkListenerSQLExecutionEnd events that cannot find a start context in sparkSqlExecutionRegistry and many of them have a LogicalPlan of type ReplaceTableAsSelect which I’m assuming I want to capture for the expected lineage
  • In prod, most of the time when we do get output lineage the Databricks EnvironmentFacet does not contain spark.databricks.job.runId or spark.databricks.job.id with the values corresponding to the child notebook run like in the test workspace. Instead it contains spark.databricks.clusterUsageTags.clusterName which has the run and job id of the parent notebook

The notebooks are in python and the command to write to the delta table is:

spark_dataframe
   .write
   .format('delta')
   .mode("Overwrite")
   .saveAsTable(write_location)

Could use some guidance on this.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
lazowmichcommented, Sep 26, 2022

@pawel-big-lebowski Awesome, glad there is an eventual fix for this from Spark. Just wish I had seen your comment before I spent all this time looking for my issue haha.

Thanks for the help! I’ll close this issue since the problem and fix are already known.

0reactions
pawel-big-lebowskicommented, Sep 22, 2022

@lazowmich @mobuchowski Great investigation! I think I had some similar findings written here: https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556

The good thing is that Spark’s master branch already contains a fix for that and SparkListenerSQLExecutionStart will contain QueryExecution. The bad thing is that, the change is not present in Spark 3.3.1-rc1 branch.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inconsistent duplicated row with Spark (Databricks on MS Azure)
I'm having a weird behavior with Apache Spark, which I run in a Python Notebook on Azure Databricks. I have a dataframe with...
Read more >
Azure Databricks: Inconsistent Connection and errno10060
I have access to an Azure Databricks Workspace and I am trying to Import that data into Power BI Desktop. However, whenever I...
Read more >
Unity Catalog: Data Governance on the Lakehouse ... - YouTube
Zeashan Pappa discusses the need for data governance in the lakehouse architecture; what good governance looks like; and how Databricks ...
Read more >
Spark df.collect() results in random inconsistent output
Without going too deep into details, I am wondering whether anyone came across similar strange behavior. I am running all of this in...
Read more >
7 Most Common Data Quality Issues | Collibra
Inconsistent data can also get introduced during migration or company mergers ... including customer complaints and poor analytical results.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found