Column-Level Lineage
See original GitHub issueSince @ekimd has started the effort for column-level lineage analysis. I create this ticket to track all the questions we have to answer before we dive into implementation details. Considering our pure code analysis approach without involving metadata, I believe we have several design choices to make.
Question No.1: What’s the data structure to represent Column-Level Lineage.
Currently we’re using DiGraph
in library networkx
to represent Table-Level Lineage, with table as vertex and table-level lineage as edge, which is pretty straight forward. After changing to Column-Level, what’s the plan?
Question No.2: How do we deal with select *
INSERT OVERWRITE tab1
SELECT * FROM tab2;
In this case, we don’t know which columns are in tab2.
Question No.3: How do we deal with column without table/alias prefix in case of join.
INSERT OVERWRITE tab1
SELECT col2
FROM tab2
JOIN tab3
ON tab2.col1 = tab3.col1
In this case, we don’t know whether col2 is coming from tab2 or tab3.
Question No.4: How do we visualize column-level lineage?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:24 (18 by maintainers)
Top GitHub Comments
Hi @wowpoppy , @ekimd approach is actually very enlightening to me. So if you feel the same, please do refer to his work.
For SQLLineage to become more than a CLI-first tool. Right now I’m doing some front end visualization work (with react/redux/cytoscape/monaco editor, to name a few JS package used here). The intention behind this is: 1) to get rid of graphviz so that it will be more friendly to windows user; 2) lay the foundation for future column-level visualization.
It’s taking me more time than I first estimated. Hopefully I’ll get front end coding part done this month and come back to column-level lineage as well as programmatic API soon after.
You’re right. Currently I’m intentional hiding some of the Python API under the hood. On the contrary, I’m assuming the CLI interface is mature enough so this is the part getting more document.
Closing this issue as we’re ready for v1.3.0 release.
Split the remaining cases to two separate issues and targeting get them fixed in v1.3.1:
Thanks everyone for helping with this feature.