Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to set the size of train-eval split with CsvExampleGen?

See original GitHub issue

I have been working with a very simple pipeline which loads the iris dataset and generates some statistics about it. But when I run the pipeline, even without specifying a train-eval split anywhere, the folders eval and train are being created under the CsvExampleGen pipeline folder, with tfrecords inside of it, and an apparently predefined split is being applied (namely around 100 training examples and 50 evaluation examples). My question is: where can I opt for doing the split or not, and where can I set the size of the split?

Pipeline code below, being run in AirFlow:

import os
import logging
import datetime
from tfx.orchestration.airflow.airflow_runner import AirflowDAGRunner
from tfx.orchestration.pipeline import PipelineDecorator

from tfx.utils.dsl_utils import csv_input
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.components.statistics_gen.component import StatisticsGen
from tfx.orchestration.tfx_runner import TfxRunner


_CASE_FOLDER = os.path.join(os.environ['HOME'], 'cases', 'iris')
_DATA_FOLDER = os.path.join(_CASE_FOLDER, 'data')
_PIPELINE_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'pipelines')
_METADATA_DB_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'metadata')
_LOG_ROOT_FOLDER = os.path.join(_CASE_FOLDER, 'logs')


@PipelineDecorator(
    pipeline_name='test_tfx_pipeline_iris',
    pipeline_root=_PIPELINE_ROOT_FOLDER,
    metadata_db_root=_METADATA_DB_ROOT_FOLDER,
    additional_pipeline_args={'logger_args': {
        'log_root': _LOG_ROOT_FOLDER,
        'log_level': logging.INFO
    }}
)
def create_pipeline():

    print("HELLO")
    examples = csv_input(_DATA_FOLDER)

    example_gen = CsvExampleGen(input_base=examples, name='iris_example_gen_1')
    #ingests this examples thing, and returns tf.Example records

    statistics_gen = StatisticsGen(input_data=example_gen.outputs.examples)

    return [
        example_gen, statistics_gen
    ]

_airflow_config = {
    'schedule_interval': None,
    'start_date': datetime.datetime(2019, 1, 1),
}
pipeline = AirflowDAGRunner(_airflow_config).run(create_pipeline())

The folder structure being generated:

Issue Analytics

State:
Created 4 years ago
Comments:9 (2 by maintainers)

Top GitHub Comments

1reaction

krazyhaascommented, Mar 22, 2019

If you’re comfortable modifying your version of TFX, you can change the example_gen executor directly. This will cause all of your pipelines to use the same ratio so buyer beware! It’s not a great experience and we’re working to elevate this parameter into the pipeline, but to unblock you in case you really really want to change the ratio to be 9:1 train:eval, make the following change: (if you are working with a github clone): tfx/components/example_gen/base_example_gen_executor.py:37 return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0

(if you are using 0.12.0 downloaded from PyPi): tfx/components/example_gen/csv_example_gen/executor.py:39 return 1 if int(hashlib.sha256(record).hexdigest(), 16) % 10 == 0 else 0

You’ll have to make the change every time the file gets overwritten (e.g. upgrading to 0.13.0) so waiting for the pipeline config parameter is definitely recommended.

1reaction

1025KBcommented, Mar 22, 2019

To be more specific, in long term, we will support pre-split input, custom ratio and probably also custom split function