Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Schema and statistics in Transform TFX Pipeline Component

See original GitHub issue

The main entry point in Transform component (preprocessing_fn) should also provide computed Stats and Schema next to the inputs. In some scenarios users might want to benefit from the statistics e.g. to eliminate unnecessary features.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

JoshLipschultzcommented, Jan 28, 2021

I’d like to re-up this request—having easier access to the metadata and schema from within the Transform callback would be tremendously useful. Some feature transformations could depend on the inferred schema. Is there some other workaround?

0reactions

wsuchycommented, Dec 3, 2019

@zoyahav Indeed. Currently in the TFX Transform Pipeline Component the processing_fn has a signature of

processing_fn(inputs:Dict[tf.Tensor]) -> Dict[tf.Tensor]

and I am proposing to change it as follows:

processing_fn(inputs:Dict[tf.Tensor], metadata:DatasetMetadata, stats:DatasetFeatureStatisticsList ) -> Dict[tf.Tensor]

or even better, as a higher order function / factory:

processing_fn_factory(metadata:DatasetMetadata, stats:DatasetFeatureStatisticsList ) 
-> Callable[Dict[tf.Tensor], Dict[tf.Tensor]]

So now my processing function could depend on the stats and schema. I know I could reach for it Stats and Schema manually, yet it requires me to talk to the SQLite database and keeping track of paths, etc.

Top Results From Across the Web

The Transform TFX Pipeline Component - TensorFlow

The Transform TFX pipeline component performs feature engineering on tf.Examples emitted from an ExampleGen component, using a data schema created by a ...

TFX standard data components - Introduction to TFX Pipelines

This allows your pipeline to scale data set statistical summaries as your data grows, with built-in logging, and fault tolerance for debugging.

TFX Components Walk-through - | notebook.community

The Transform component performs data transformation and feature engineering. The Transform component consumes tf.Examples emitted from the ExampleGen component ...

How to use the tfx.components.base.executor_spec ... - Snyk

Performs anomaly detection based on statistics and data schema. ... In a typical TFX pipeline, the SchemaGen component generates a schema which is...

https://raw.githubusercontent.com/kubeflow/pipelin...

The Transform component wraps TensorFlow Transform (tf.Transform) to preprocess data in a TFX pipeline. This component will load the preprocessing_fn from ...