Design Review: column level lineage feature
See original GitHub issueIs your feature request related to a problem? Please describe. Yes. Column level lineage support has been requested for a few times in the past.
Describe the solution you’d like This issue is meant to have a documentation to address how to design this feature.
Describe alternatives you’ve considered n/a
Additional context
While datahub currently is supporting table-level lineage as a dataset’s aspect. There is a strong need to get column-level lineage.
A sample illustration of this column-level lineage as:

If we look at the right part of this screenshot. We notice that
- table
INSERT-SELECT-1came from tableordersandcustomers - the
oid,cid,ottl,sidcolumns ofINSERT-SELECT-1were from ones oforderstable - the
clandcemcolumns ofINSERT-SELECT-1were from ones ofcustomerstable. - there are more tables on the right,
small_orders,medium_orders,large_ordersandspecial_ordersare derived fromINSERT-SELECT-1
Below this INSERT-SELECT-1, there is another lineage representation cases following the similar fashion.
Now we look at the left part of this screenshot. We notice how the SQL statement is used to generate the target table, and how the columns in the target table are derived from the source tables.
In this design review, I think we need to address two important issues:
- How should we modify Dataset’s
Upstream.pdlto support column level lineage. To make it easier to understand, the currentUpstream.pdllook like (deleted code comment for abbreviation)
import com.linkedin.common.DatasetUrn
record Upstream {
auditStamp: AuditStamp
dataset: DatasetUrn
type: DatasetLineageType
}
- How could we provide sample script (python like) so end-user would use it to parse their
sqlstatement easily, and ingest MCE message so Datahub could pick them up.
To be continued
Issue Analytics
- State:
- Created 3 years ago
- Reactions:13
- Comments:21 (21 by maintainers)

Top Related StackOverflow Question
Absolutely. @jplaisted . This issue was created before RFC was adopted. Happy to convert this into a RFC for future reference. Also happy to split it into multiple PRs. Intrinsically, I don’t expect this PR gotten merged, since I did some hacks.
So I think this issue has a lot of really great ideas in it, but it is starting to get a little large and hard to follow. Jumping right from here to a large PR isn’t that easy either 😃
Can we maybe try the full RFC process here? i.e. a design doc? That should be easier to follow than this issue (the latest state of the RFC PR is the current proposal, no need to read a large back and forth discussion if you want to jump right in), and we can review that, and then after that is ok’d we can start code reviews.
I would also strongly suggest multiple PRs; try to make them smaller. A good example is the first PR should probably be models only, no code changes. Then you can start adding code support.
Let me know what you think, thanks!