question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API for attaching user metadata to the execution plan and event

See original GitHub issue

Feature

Allow pass custom key value pairs from spark job which is sent along with lineage data either in executionPlan or executionEvent. This will be powerful feature to allow users to add some metadata to the lineage. I am not sure if this feature already exists as I can see a property called extraInfo: Map[String, Any] = Map.empty in ExecutionPlan which looks like it is may be used for this purpose.

Background

The current immediate requirement is to have JobId and RunId passed as part of lineage data.

JobId: Is essentially just a unique name for the notebook that runs as job. Using Azure Databricks the applicationName and applicationId is autogenerated. These are cluster specific and not “job” specific.

RunId: is unique for a run of a job. If there are two write operation in the job then two executionPlan is generated. There is no way I can see to tell whether the two executionPlan is from same job running once (meaning there are two writes) or the job running twice (single write).

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
wajdacommented, May 17, 2020

New API example:

spark.enableLineageTracking(new DefaultSplineConfigurer(conf) {
  override protected def userExtraMetadataProvider = new UserExtraMetaDataProvider {
    override def forExecEvent(event: ExecutionEvent, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
    override def forExecPlan(plan: ExecutionPlan, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
    override def forOperation(op: ReadOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
    override def forOperation(op: WriteOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
    override def forOperation(op: DataOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
  }
})

There is also NoopUserExtraMetaDataProvider class with all forXXXX() methods returning empty Map. You can extend that class and only override methods that you need.

For codeless mode the following property could be used to instantiate custom UserExtraMetaDataProvider:

spline.user_extra_meta_provider.className=com.my.FooBarExtraMetaDataProvider
1reaction
ankitbkocommented, Mar 12, 2020

Just updating… got it working. Had to implement StandardSplineConfigurationStack as its not present in 0.4 and the same logic is in a private function.

Also for additional flexibility, I read one property from sparkconf which is semicolon separated keys which are then read and passed with executionPlan

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understand How Metadata Works in User Profiles - Auth0
Describes Auth0 user, application, and client metadata. Learn how you can use metadata to store information that does not originate from an identity ......
Read more >
Metadata API Developer Guide
The main purpose of Metadata API is to move metadata between Salesforce orgs during the development process. Use Metadata API.
Read more >
Metadata – curl - Stripe API reference
Metadata is useful for storing additional, structured information on an object. For example, you could store your user's corresponding unique identifier from ...
Read more >
Configure Streaming API Settings with Metadata API
Use the new EventSettings Metadata API type to configure Streaming API settings, such as enabling Streaming API and dynamic generic event channel creation....
Read more >
Analytics API Reference - Keen IO
All are requiring API Keys for the project you want to use. ... You are not allowed to query events older than 3...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found