Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shuffling in ExampleGen should be optional

See original GitHub issue

The Docs mention that ExampleGen “shuffles the dataset for ML best practice”. However, if the use case is a time series problem using sliding windows, shuffling before splitting in train and eval set is counterproductive as I’d need a coherent training set.

To accomplish this for now (as I understand it) one would have to create an entire custom ExampleGen by modifying base_example_gen_executor and remove 'Shuffle' >> beam.transforms.Reshuffle().

It would be great if this wasn’t necessary and the shuffling in ExampleGen could be switched off directly when calling example_gen = CsvExampleGen(input=examples) e.g. by using shuffle=False.

Issue Analytics

State:
Created 3 years ago
Reactions:7
Comments:14 (2 by maintainers)

Top GitHub Comments

2reactions

rezaroknicommented, Nov 19, 2020

It is for sure a restrictive practice, but sometimes maybe necessary, it depends on the system as a whole.

Some more notes, that can help define the approach… Apache Beam, does not guarantee order as mentioned by @1025KB ExampleGen will read the files in random order ( many threads can be reading in parallel). So even if the files that the examples are stored in have file names that would list in sequential order the read would not be sequential. And if the files are splittable, like uncompressed cvs files then a singles files read can be done by many threads as well.

Its potentially possible to write a custom ExampleGens,

One version could use a Beam pipeline to read all the data , use the elements timestamps and then create sliding windows from that data. With the window parameters sent in to the ExampleGen at runtime.
another option , which I have not had time to explore yet, would be to land fixed length sequences ( not sliding ) and then use Beam to process the metadata and create a processing Map with start end offsets for all the sequences windows then through GBK left-right combos to create the sliding windows from the fixed length sequences. SequenceExample would be a good candidate for storage here as the context could be used to provide the metadata needed for the first phase of the pipeline. But this is more complex than the first option and may not actually gain much in terms of processing time.

Another consideration… At its core one component will need to create a ordered sequence at some point. In a streaming prediction usecase, where the inference is being done in real time from a streaming source, the inference system needs to create the [timestep, feature] shape anyway, so in that case having that same system also output its values direct to a bucket ready for ExampleGen can make sense as the processing is being done already. However as pointed out the down side there is the amount of storage used does increase significantly, essentially by the length of the sliding window * the offset for the slide. A mitigation for this that is not valid for every use case is to downsample the data before adding to the sliding window. For example creation of fixed window First/Last/Max/Min objects which are then used within the sliding window to give objects of shape [[First/Last/Max/Min],[First/Last/Max/Min]…].

I hope to be able to explore the SequenceExample option with a custom examplegen in Dec.

1reaction

ntakouriscommented, Nov 18, 2020

Creating the windows beforehand is a very very bad practise, which also inflates the size of the dataset by a huge margin. We can just use tf.data.dataset for that. Look for more info on the above issue that got mentioned.

A similar way to avoid unnecessary data duplication was used in the materialize=False parameter of the Transform component.

Top Results From Across the Web

The ExampleGen TFX Pipeline Component - TensorFlow

The ExampleGen TFX Pipeline component ingests data into TFX pipelines. It consumes external files/services to generate Examples which will ...

TensorFlow Extended (TFX) for Dummies(Part Uno!)

The ExampleGen TFX Pipeline component ingests data into TFX pipelines. It consumes external files/services to generate Examples which will be read by other ......

TFX Components Walk-through - | notebook.community

The ExampleGen component ingests data into a TFX pipeline. ... It can also shuffle the data and split into an arbitrary number of...

TFX standard data components - Coursera

The ExampleGen TFX pipeline component is the entry point to your ... and reproducible data partitioning and shuffling into TF Records, ...

workshop:Building custom TFX Components - Hannes Hapke

Step-by-step walkthrough of how to write custom TFX components with Apache Beam to customize your ML pipelines beyond the standard ...