How to set the size of train-eval split with CsvExampleGen?
See original GitHub issueI have been working with a very simple pipeline which loads the iris dataset and generates some statistics about it. But when I run the pipeline, even without specifying a train-eval split anywhere, the folders eval and train are being created under the CsvExampleGen pipeline folder, with tfrecords inside of it, and an apparently predefined split is being applied (namely around 100 training examples and 50 evaluation examples). My question is: where can I opt for doing the split or not, and where can I set the size of the split?
Pipeline code below, being run in AirFlow:
import os
import logging
import datetime
from tfx.orchestration.airflow.airflow_runner import AirflowDAGRunner
from tfx.orchestration.pipeline import PipelineDecorator
from tfx.utils.dsl_utils import csv_input
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.components.statistics_gen.component import StatisticsGen
from tfx.orchestration.tfx_runner import TfxRunner
_CASE_FOLDER = os.path.join(os.environ['HOME'], 'cases', 'iris')
_DATA_FOLDER = os.path.join(_CASE_FOLDER, 'data')
_PIPELINE_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'pipelines')
_METADATA_DB_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'metadata')
_LOG_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'logs')
@PipelineDecorator(
pipeline_name='test_tfx_pipeline_iris',
pipeline_root=_PIPELINE_ROOT_FOLDER,
metadata_db_root=_METADATA_DB_ROOT_FOLDER,
additional_pipeline_args={'logger_args': {
'log_root': _LOG_ROOT_FOLDER,
'log_level': logging.INFO
}}
)
def create_pipeline():
print("HELLO")
examples = csv_input(_DATA_FOLDER)
example_gen = CsvExampleGen(input_base=examples, name='iris_example_gen_1')
#ingests this examples thing, and returns tf.Example records
statistics_gen = StatisticsGen(input_data=example_gen.outputs.examples)
return [
example_gen, statistics_gen
]
_airflow_config = {
'schedule_interval': None,
'start_date': datetime.datetime(2019, 1, 1),
}
pipeline = AirflowDAGRunner(_airflow_config).run(create_pipeline())
The folder structure being generated:
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
The ExampleGen TFX Pipeline Component - TensorFlow
To customize the train/eval split ratio which ExampleGen will output, set the output_config for ExampleGen component. For example:.
Read more >TFX - What is example_gen_pb2 and where is it documented?
Input has a single split 'input_dir/*'. # Output 2 splits: train:eval=3:1. output = proto.Output( split_config=example_gen_pb2.
Read more >Machine Learning Pipeline with TFX - Deepnote
The CsvExampleGen component from the tfx will be used to ingest the csv data. The train data will be split into train-eval set...
Read more >ML Model in Production: Real-world example of End-to-End ...
example_gen = CsvExampleGen(input_base=examples). Custom input/output split. To customize the train/eval split ratio which ExampleGen will ...
Read more >Data Ingestion with TensorFlow eXtended (TFX)
from tfx.components import CsvExampleGen from tfx.utils import dsl_utils examples ... and for train/eval splits we would use output_config .
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
If you’re comfortable modifying your version of TFX, you can change the example_gen executor directly. This will cause all of your pipelines to use the same ratio so buyer beware! It’s not a great experience and we’re working to elevate this parameter into the pipeline, but to unblock you in case you really really want to change the ratio to be 9:1 train:eval, make the following change: (if you are working with a github clone):
tfx/components/example_gen/base_example_gen_executor.py:37
return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0
(if you are using 0.12.0 downloaded from PyPi):
tfx/components/example_gen/csv_example_gen/executor.py:39
return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0
You’ll have to make the change every time the file gets overwritten (e.g. upgrading to 0.13.0) so waiting for the pipeline config parameter is definitely recommended.
To be more specific, in long term, we will support pre-split input, custom ratio and probably also custom split function