Design Review: column level lineage feature
See original GitHub issueIs your feature request related to a problem? Please describe. Yes. Column level lineage support has been requested for a few times in the past.
Describe the solution you’d like This issue is meant to have a documentation to address how to design this feature.
Describe alternatives you’ve considered n/a
Additional context While datahub currently is supporting table-level lineage as a dataset’s aspect. There is a strong need to get column-level lineage. A sample illustration of this column-level lineage as:
If we look at the right part of this screenshot. We notice that
- table
INSERT-SELECT-1
came from tableorders
andcustomers
- the
oid
,cid
,ottl
,sid
columns ofINSERT-SELECT-1
were from ones oforders
table - the
cl
andcem
columns ofINSERT-SELECT-1
were from ones ofcustomers
table. - there are more tables on the right,
small_orders
,medium_orders
,large_orders
andspecial_orders
are derived fromINSERT-SELECT-1
Below this INSERT-SELECT-1
, there is another lineage representation cases following the similar fashion.
Now we look at the left part of this screenshot. We notice how the SQL statement is used to generate the target table, and how the columns in the target table are derived from the source tables.
In this design review, I think we need to address two important issues:
- How should we modify Dataset’s
Upstream.pdl
to support column level lineage. To make it easier to understand, the currentUpstream.pdl
look like (deleted code comment for abbreviation)
import com.linkedin.common.DatasetUrn
record Upstream {
auditStamp: AuditStamp
dataset: DatasetUrn
type: DatasetLineageType
}
- How could we provide sample script (python like) so end-user would use it to parse their
sql
statement easily, and ingest MCE message so Datahub could pick them up.
To be continued
Issue Analytics
- State:
- Created 3 years ago
- Reactions:13
- Comments:21 (21 by maintainers)
Top GitHub Comments
Absolutely. @jplaisted . This issue was created before RFC was adopted. Happy to convert this into a RFC for future reference. Also happy to split it into multiple PRs. Intrinsically, I don’t expect this PR gotten merged, since I did some hacks.
So I think this issue has a lot of really great ideas in it, but it is starting to get a little large and hard to follow. Jumping right from here to a large PR isn’t that easy either 😃
Can we maybe try the full RFC process here? i.e. a design doc? That should be easier to follow than this issue (the latest state of the RFC PR is the current proposal, no need to read a large back and forth discussion if you want to jump right in), and we can review that, and then after that is ok’d we can start code reviews.
I would also strongly suggest multiple PRs; try to make them smaller. A good example is the first PR should probably be models only, no code changes. Then you can start adding code support.
Let me know what you think, thanks!