question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Schema and statistics in Transform TFX Pipeline Component

See original GitHub issue

The main entry point in Transform component (preprocessing_fn) should also provide computed Stats and Schema next to the inputs. In some scenarios users might want to benefit from the statistics e.g. to eliminate unnecessary features.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
JoshLipschultzcommented, Jan 28, 2021

I’d like to re-up this request—having easier access to the metadata and schema from within the Transform callback would be tremendously useful. Some feature transformations could depend on the inferred schema. Is there some other workaround?

0reactions
wsuchycommented, Dec 3, 2019

@zoyahav Indeed. Currently in the TFX Transform Pipeline Component the processing_fn has a signature of

processing_fn(inputs:Dict[tf.Tensor]) -> Dict[tf.Tensor]

and I am proposing to change it as follows:

processing_fn(inputs:Dict[tf.Tensor], metadata:DatasetMetadata, stats:DatasetFeatureStatisticsList ) -> Dict[tf.Tensor]

or even better, as a higher order function / factory:

processing_fn_factory(metadata:DatasetMetadata, stats:DatasetFeatureStatisticsList ) 
-> Callable[Dict[tf.Tensor], Dict[tf.Tensor]]

So now my processing function could depend on the stats and schema. I know I could reach for it Stats and Schema manually, yet it requires me to talk to the SQLite database and keeping track of paths, etc.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Transform TFX Pipeline Component - TensorFlow
The Transform TFX pipeline component performs feature engineering on tf.Examples emitted from an ExampleGen component, using a data schema created by a ...
Read more >
TFX standard data components - Introduction to TFX Pipelines
This allows your pipeline to scale data set statistical summaries as your data grows, with built-in logging, and fault tolerance for debugging.
Read more >
TFX Components Walk-through - | notebook.community
The Transform component performs data transformation and feature engineering. The Transform component consumes tf.Examples emitted from the ExampleGen component ...
Read more >
How to use the tfx.components.base.executor_spec ... - Snyk
Performs anomaly detection based on statistics and data schema. ... In a typical TFX pipeline, the SchemaGen component generates a schema which is...
Read more >
https://raw.githubusercontent.com/kubeflow/pipelin...
The Transform component wraps TensorFlow Transform (tf.Transform) to preprocess data in a TFX pipeline. This component will load the preprocessing_fn from ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found