Shuffling in ExampleGen should be optional
See original GitHub issueThe Docs mention that ExampleGen “shuffles the dataset for ML best practice”. However, if the use case is a time series problem using sliding windows, shuffling before splitting in train and eval set is counterproductive as I’d need a coherent training set.
To accomplish this for now (as I understand it) one would have to create an entire custom ExampleGen by modifying base_example_gen_executor
and remove 'Shuffle' >> beam.transforms.Reshuffle()
.
It would be great if this wasn’t necessary and the shuffling in ExampleGen could be switched off directly when calling
example_gen = CsvExampleGen(input=examples)
e.g. by using shuffle=False
.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:7
- Comments:14 (2 by maintainers)
Top Results From Across the Web
The ExampleGen TFX Pipeline Component - TensorFlow
The ExampleGen TFX Pipeline component ingests data into TFX pipelines. It consumes external files/services to generate Examples which will ...
Read more >TensorFlow Extended (TFX) for Dummies(Part Uno!)
The ExampleGen TFX Pipeline component ingests data into TFX pipelines. It consumes external files/services to generate Examples which will be read by other ......
Read more >TFX Components Walk-through - | notebook.community
The ExampleGen component ingests data into a TFX pipeline. ... It can also shuffle the data and split into an arbitrary number of...
Read more >TFX standard data components - Coursera
The ExampleGen TFX pipeline component is the entry point to your ... and reproducible data partitioning and shuffling into TF Records, ...
Read more >workshop:Building custom TFX Components - Hannes Hapke
Step-by-step walkthrough of how to write custom TFX components with Apache Beam to customize your ML pipelines beyond the standard ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It is for sure a restrictive practice, but sometimes maybe necessary, it depends on the system as a whole.
Some more notes, that can help define the approach… Apache Beam, does not guarantee order as mentioned by @1025KB ExampleGen will read the files in random order ( many threads can be reading in parallel). So even if the files that the examples are stored in have file names that would list in sequential order the read would not be sequential. And if the files are splittable, like uncompressed cvs files then a singles files read can be done by many threads as well.
Its potentially possible to write a custom ExampleGens,
One version could use a Beam pipeline to read all the data , use the elements timestamps and then create sliding windows from that data. With the window parameters sent in to the ExampleGen at runtime.
another option , which I have not had time to explore yet, would be to land fixed length sequences ( not sliding ) and then use Beam to process the metadata and create a processing Map with start end offsets for all the sequences windows then through GBK left-right combos to create the sliding windows from the fixed length sequences. SequenceExample would be a good candidate for storage here as the context could be used to provide the metadata needed for the first phase of the pipeline. But this is more complex than the first option and may not actually gain much in terms of processing time.
Another consideration… At its core one component will need to create a ordered sequence at some point. In a streaming prediction usecase, where the inference is being done in real time from a streaming source, the inference system needs to create the [timestep, feature] shape anyway, so in that case having that same system also output its values direct to a bucket ready for ExampleGen can make sense as the processing is being done already. However as pointed out the down side there is the amount of storage used does increase significantly, essentially by the length of the sliding window * the offset for the slide. A mitigation for this that is not valid for every use case is to downsample the data before adding to the sliding window. For example creation of fixed window First/Last/Max/Min objects which are then used within the sliding window to give objects of shape [[First/Last/Max/Min],[First/Last/Max/Min]…].
I hope to be able to explore the SequenceExample option with a custom examplegen in Dec.
Creating the windows beforehand is a very very bad practise, which also inflates the size of the dataset by a huge margin. We can just use tf.data.dataset for that. Look for more info on the above issue that got mentioned.
A similar way to avoid unnecessary data duplication was used in the
materialize=False
parameter of the Transform component.