question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to set the size of train-eval split with CsvExampleGen?

See original GitHub issue

I have been working with a very simple pipeline which loads the iris dataset and generates some statistics about it. But when I run the pipeline, even without specifying a train-eval split anywhere, the folders eval and train are being created under the CsvExampleGen pipeline folder, with tfrecords inside of it, and an apparently predefined split is being applied (namely around 100 training examples and 50 evaluation examples). My question is: where can I opt for doing the split or not, and where can I set the size of the split?

Pipeline code below, being run in AirFlow:

import os
import logging
import datetime
from tfx.orchestration.airflow.airflow_runner import AirflowDAGRunner
from tfx.orchestration.pipeline import PipelineDecorator

from tfx.utils.dsl_utils import csv_input
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.components.statistics_gen.component import StatisticsGen
from tfx.orchestration.tfx_runner import TfxRunner


_CASE_FOLDER = os.path.join(os.environ['HOME'], 'cases', 'iris')
_DATA_FOLDER = os.path.join(_CASE_FOLDER, 'data')
_PIPELINE_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'pipelines')
_METADATA_DB_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'metadata')
_LOG_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'logs')


@PipelineDecorator(
    pipeline_name='test_tfx_pipeline_iris',
    pipeline_root=_PIPELINE_ROOT_FOLDER,
    metadata_db_root=_METADATA_DB_ROOT_FOLDER,
    additional_pipeline_args={'logger_args': {
        'log_root': _LOG_ROOT_FOLDER,
        'log_level': logging.INFO
    }}
)
def create_pipeline():

    print("HELLO")
    examples = csv_input(_DATA_FOLDER)

    example_gen = CsvExampleGen(input_base=examples, name='iris_example_gen_1')
    #ingests this examples thing, and returns tf.Example records

    statistics_gen = StatisticsGen(input_data=example_gen.outputs.examples)

    return [
        example_gen, statistics_gen
    ]

_airflow_config = {
    'schedule_interval': None,
    'start_date': datetime.datetime(2019, 1, 1),
}
pipeline = AirflowDAGRunner(_airflow_config).run(create_pipeline())

The folder structure being generated:

image

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
krazyhaascommented, Mar 22, 2019

If you’re comfortable modifying your version of TFX, you can change the example_gen executor directly. This will cause all of your pipelines to use the same ratio so buyer beware! It’s not a great experience and we’re working to elevate this parameter into the pipeline, but to unblock you in case you really really want to change the ratio to be 9:1 train:eval, make the following change: (if you are working with a github clone): tfx/components/example_gen/base_example_gen_executor.py:37 return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0

(if you are using 0.12.0 downloaded from PyPi): tfx/components/example_gen/csv_example_gen/executor.py:39 return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0

You’ll have to make the change every time the file gets overwritten (e.g. upgrading to 0.13.0) so waiting for the pipeline config parameter is definitely recommended.

1reaction
1025KBcommented, Mar 22, 2019

To be more specific, in long term, we will support pre-split input, custom ratio and probably also custom split function

Read more comments on GitHub >

github_iconTop Results From Across the Web

The ExampleGen TFX Pipeline Component - TensorFlow
To customize the train/eval split ratio which ExampleGen will output, set the output_config for ExampleGen component. For example:.
Read more >
TFX - What is example_gen_pb2 and where is it documented?
Input has a single split 'input_dir/*'. # Output 2 splits: train:eval=3:1. output = proto.Output( split_config=example_gen_pb2.
Read more >
Machine Learning Pipeline with TFX - Deepnote
The CsvExampleGen component from the tfx will be used to ingest the csv data. The train data will be split into train-eval set...
Read more >
ML Model in Production: Real-world example of End-to-End ...
example_gen = CsvExampleGen(input_base=examples). Custom input/output split. To customize the train/eval split ratio which ExampleGen will ...
Read more >
Data Ingestion with TensorFlow eXtended (TFX)
from tfx.components import CsvExampleGen from tfx.utils import dsl_utils examples ... and for train/eval splits we would use output_config .
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found