question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Design Review: column level lineage feature

See original GitHub issue

Is your feature request related to a problem? Please describe. Yes. Column level lineage support has been requested for a few times in the past.

Describe the solution you’d like This issue is meant to have a documentation to address how to design this feature.

Describe alternatives you’ve considered n/a

Additional context While datahub currently is supporting table-level lineage as a dataset’s aspect. There is a strong need to get column-level lineage. A sample illustration of this column-level lineage as: column-level-lineage

If we look at the right part of this screenshot. We notice that

  1. table INSERT-SELECT-1 came from table orders and customers
  2. theoid, cid, ottl, sid columns of INSERT-SELECT-1 were from ones of orders table
  3. the cl and cem columns of INSERT-SELECT-1 were from ones of customers table.
  4. there are more tables on the right, small_orders, medium_orders, large_orders and special_orders are derived from INSERT-SELECT-1

Below this INSERT-SELECT-1, there is another lineage representation cases following the similar fashion.

Now we look at the left part of this screenshot. We notice how the SQL statement is used to generate the target table, and how the columns in the target table are derived from the source tables.

In this design review, I think we need to address two important issues:

  1. How should we modify Dataset’s Upstream.pdl to support column level lineage. To make it easier to understand, the current Upstream.pdl look like (deleted code comment for abbreviation)
import com.linkedin.common.DatasetUrn

record Upstream {
  auditStamp: AuditStamp
  dataset: DatasetUrn
  type: DatasetLineageType
}
  1. How could we provide sample script (python like) so end-user would use it to parse their sql statement easily, and ingest MCE message so Datahub could pick them up.

To be continued

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:13
  • Comments:21 (21 by maintainers)

github_iconTop GitHub Comments

3reactions
liangjun-jiangcommented, Aug 5, 2020

Absolutely. @jplaisted . This issue was created before RFC was adopted. Happy to convert this into a RFC for future reference. Also happy to split it into multiple PRs. Intrinsically, I don’t expect this PR gotten merged, since I did some hacks.

2reactions
jplaistedcommented, Aug 4, 2020

So I think this issue has a lot of really great ideas in it, but it is starting to get a little large and hard to follow. Jumping right from here to a large PR isn’t that easy either 😃

Can we maybe try the full RFC process here? i.e. a design doc? That should be easier to follow than this issue (the latest state of the RFC PR is the current proposal, no need to read a large back and forth discussion if you want to jump right in), and we can review that, and then after that is ok’d we can start code reviews.

I would also strongly suggest multiple PRs; try to make them smaller. A good example is the first PR should probably be models only, no code changes. Then you can start adding code support.

Let me know what you think, thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Design Review: column level lineage feature #1731 - GitHub
In this design review, I think we need to address two important issues: How should we modify Dataset's Upstream.pdl to support column level...
Read more >
Building End-to-End Field Level Lineage for Modern Data ...
Automating lineage creation and abstracting metadata down to the field-level cuts down on the time and resources required to conduct root cause ...
Read more >
Column-Level Lineage Design - SQLLineage - Read the Docs
Don't create column-level lineage DAG to be a separate graph from table-level DAG. ... Answer: See design principle for two possible data structure....
Read more >
Data Lineage - Varigence Support Documentation
The Data Lineage Designer can be used to review your Column Mappings in an easy graphical interface. The designer provides an view of...
Read more >
14 Questions to Ask When Evaluating Data Lineage
Looking for a data lineage tool? These are the 14 key questions to ask, "gotchas" to watch for, and features to examine.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found