question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Column-Level Lineage

See original GitHub issue

Since @ekimd has started the effort for column-level lineage analysis. I create this ticket to track all the questions we have to answer before we dive into implementation details. Considering our pure code analysis approach without involving metadata, I believe we have several design choices to make.

Question No.1: What’s the data structure to represent Column-Level Lineage. Currently we’re using DiGraph in library networkx to represent Table-Level Lineage, with table as vertex and table-level lineage as edge, which is pretty straight forward. After changing to Column-Level, what’s the plan?

Question No.2: How do we deal with select *

INSERT OVERWRITE tab1
SELECT * FROM tab2;

In this case, we don’t know which columns are in tab2.

Question No.3: How do we deal with column without table/alias prefix in case of join.

INSERT OVERWRITE tab1
SELECT col2
FROM tab2
JOIN tab3
ON tab2.col1 = tab3.col1

In this case, we don’t know whether col2 is coming from tab2 or tab3.

Question No.4: How do we visualize column-level lineage?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:24 (18 by maintainers)

github_iconTop GitHub Comments

2reactions
reatacommented, Mar 9, 2021

Hi @wowpoppy , @ekimd approach is actually very enlightening to me. So if you feel the same, please do refer to his work.

For SQLLineage to become more than a CLI-first tool. Right now I’m doing some front end visualization work (with react/redux/cytoscape/monaco editor, to name a few JS package used here). The intention behind this is: 1) to get rid of graphviz so that it will be more friendly to windows user; 2) lay the foundation for future column-level visualization.

It’s taking me more time than I first estimated. Hopefully I’ll get front end coding part done this month and come back to column-level lineage as well as programmatic API soon after.

You’re right. Currently I’m intentional hiding some of the Python API under the hood. On the contrary, I’m assuming the CLI interface is mature enough so this is the part getting more document.

0reactions
reatacommented, Nov 13, 2021

Closing this issue as we’re ready for v1.3.0 release.

Split the remaining cases to two separate issues and targeting get them fixed in v1.3.1:

Thanks everyone for helping with this feature.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Column-level Lineage - Data Reliability Platform - Datafold
Plug and play column-level lineage for the modern data stack · Get same-day column-level lineage · Explore dependencies across thousands of tables and...
Read more >
Column-Level Lineage - Atlan
Column -level confidence · Smooth. Intuitive. Interactive. · Dig deep into BI. Strike gold.✨ · Active data lineage · Automatic propagation. Auto-magic value....
Read more >
The Current State of Column-level Lineage - OpenLineage
Column -level lineage is a worthy pursuit. It dramatically extends the reach of OpenLineage's metadata capture, providing finely grained ...
Read more >
Column-Level Lineage Design - SQLLineage - Read the Docs
Don't create column-level lineage DAG to be a separate graph from table-level DAG. There should be one unified DAG. Either 1) we build...
Read more >
Automated Column-Level Lineage to Secure End-to-End Data ...
Masthead column-level data lineage visualizes how the data in your organization flows from its sources to data consumers. You and your data team...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found