Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Allow for data exchange between custom KFP components

See original GitHub issue

Describe the issue

Currently the VPE does not expose KFP component outputs, making it impossible for users to choose them as inputs for subsequent nodes. In this example, get lines produces a file that count lines needs to access.

A working DSL implementation of this pipeline can be found here: https://github.com/ptitzler/kfp-component-tests/blob/main/pipelines/create_and_run_pipeline-1.py

To Reproduce Steps to reproduce the behavior:

Register https://raw.githubusercontent.com/ptitzler/kfp-component-tests/main/example-1/component.yaml as a KFP component (Get Lines)
Register https://raw.githubusercontent.com/ptitzler/kfp-component-tests/main/example-2/component.yaml as a KFP component (Count Lines)
Create a pipeline from the two components as shown above.

Expected behavior

A user should be able to select any compatible output from a parent node as an input. (see example scenarios listed in https://github.com/elyra-ai/elyra/issues/1761#issuecomment-872596503)

DESIGN

========================================================================================== Motivation

Currently, Elyra supports a handful of sample components from Apache Airflow and Kubeflow Pipelines. These components demonstrate Elyra’s ability to use native concepts from each orchestrator, however a key portion of their functionality is missing, notably the ability to pass data and/or parameters from one component/operator to another via inputs and outputs

Considerations

We want to limit the scope of the issue to just the exchange of data between runtime native components. That is, for the time being, support data exchanging between Airflow operators -> Airflow operators and KFP components -> KFP components.

We support both Apache Airflow and Kubeflow Pipelines but both runtimes have very different ways of defining inputs and outputs.

Apache Airflow

Apache Airflow uses the concept of Xcoms or Cross Communication. Xcoms are small amounts of data that are shared between tasks (nodes). The data is represented by a key-value pair with the key being a string and a value that is serializable in JSON or pickable(pickle). These Xcoms can be pushed and pulled between tasks and by default are scoped to the DAG run (pipeline run)

Xcoms are built into the Airflow BaseOperator so all operators inherit them and are accessed via the task_instance(ti) object and xcom_push and xcom_pull helper methods.

t1 = BashOperator(
    task_id="t1",
    bash_command='echo "{{ ti.xcom_push(key="k1", value="v1") }}" "{{ti.xcom_push(key="k2", value="v2") }}"',
    dag=dag,
)
t2 = BashOperator(
    task_id="t2",
    bash_command='echo "{{ ti.xcom_pull(key="k1") }}" "{{ ti.xcom_pull(key="k2") }}"',
    dag=dag,
)
t1 >> t2

Limitations:

Note that there are size limitations to the amount of data that can be passed via Xcoms. Best practices seems to suggest that objects around a few MBs are ok to pass via Xcoms but anything larger should be handled via by file path reference (volumes, s3)

Resources: A good guide : https://marclamberti.com/blog/airflow-xcom/

Kubeflow Pipelines

Elyra uses KFP component definitions when considering how it handles input and outputs and how to share data. Inputs and Outputs are specified in the component definition under each respective name and are then used in the implementation with type hinting(inputPath, inputValue, outputPath) to describe how each argument should be processed, either by reference(*Path) or value(*Value).

name: Truncate file
description: Gets the specified number of lines from the input file.

inputs:
- {name: Input 1, type: String, optional: false, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', optional: true, description: 'Number of lines to keep'}

outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}

implementation:
  container:
    image: quay.io/ptitzler/kfp-ex-truncate-file@sha256:37e20c5f5daae264a05f7bb595aac19ebd7b045667b7056ba3a13fda1b86746e
    # command is a list of strings (command-line arguments). 
    # The YAML language has two syntaxes for lists and you can use either of them. 
    # Here we use the "flow syntax" - comma-separated strings inside square brackets.
    command: [
      python3, 
      # Path of the program inside the container
      /pipelines/component/src/truncate-file.py,
      --input1-path,
      {inputPath: Input 1},
      --param1, 
      {inputValue: Parameter 1},
      --output1-path, 
      {outputPath: Output 1},
    ]

Using the truncate example above:

The following inputs and outputs would appear as a properties in the node properties pane in the pipeline editor

inputs:
- {name: Input 1, type: String, optional: false, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', optional: true, description: 'Number of lines to keep'}

outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}

All inputs and outputs need to be pre-defined in the component definition. There are 3 types of input/output fields. These fields are hints and used to tell KFP how to ultimately use them during compilation.

python3, 
      # Path of the program inside the container
      /pipelines/component/src/truncate-file.py,
      --input1-path,
      {inputPath: Input 1},
      --param1, 
      {inputValue: Parameter 1},
      --output1-path, 
      {outputPath: Output 1},

inputValue: - used for direct intake of parameters or small bits of data, these parameters are passed directly as values to the application in the container
inputPath:
- Runtime inputs - these inputs are read in from upstream sources created during pipeline execution e.g. nodeA.outputs[‘output_1’] See https://github.com/akchinSTC/elyra/blob/e0f5ef0234d0ad79a4cb848eb108ac00dbe0af64/etc/test_download_and_count.py#L25-L27
outputPath - the location of where the application within the image writes its data.
- Output Paths cannot be set explicitly. These are handled by KFP exclusively. https://stackoverflow.com/questions/67241248/how-to-use-outputpath-across-multiple-components-in-kubeflow. This means that any outputs from components should be defined at the definition level and cannot be modified at runtime. The registry should still make these outputs available to the front end when constructing the pipeline
Each of these input/output fields are also defined as types:
- Int, String, Float, Bool, List, Dict, LocalPath

Limitations: Best practices indicate that users should limit the amount of data passed by value to 200KB per pipeline run.

Envisioned workflow Given that the inputs and outputs are all defined prior to submission, we should be able to translate them into the appropriate fields for each runtime in order to properly execute the pipeline.

A user would be able to define both the input(s) and output(s) in the node properties pane
Outputs defined in upstream nodes would be made available to downstream nodes in the UI

Example scenarios (Cn = <component N>, On = <output N>, In <input N>)

C1() -> C2(I1) => For C2 user can

do nothing

enter the value of I1

C1(O1,O2) -> C2 (I1) => For C2 user can

do nothing

enter the value of I1

select O1 or O2 as the value of I1

C1(O1,O2) -> C2 (I1,O3) -> C3(I2) => for C3 user can

do nothing

enter the value of I2

select O1 or O2 as the value of I2

select O3 as the value of I2

Example:

==========================================================================================

KFP Implementation

Expectations after completion:

Users will be able to construct a pipeline consisting of both generic and kfp components
Users will only be able to configure component parameters that are typed as inputValues from the VPE
Users will be able to exchange input and output between kfp components -> kfp components
Users will continue to be able to exchange input and output between generic components -> generic components
Users will not be able to add outputs from any nodes not explicitly linked in the visual pipeline editor
Users will not be able to exchange inputs and outputs between generic components -> kfp components and vice versa
Users will not be able to explicitly pass in data as values into a component parameter designated as a inputPath
Users will not be able to include any local dependencies/files with kfp components

Open Discussions:

How to handle naming conflicts between inputs and outputs e.g. papermill notebook name is same in both input and output.
- Prefix (output_*) vs. adding sub stanza [‘inputs’] and [‘outputs’] to [‘app_data’][‘component_parameters’] in payload to processor?
How will the front end determine available list of outputs for downstream nodes and how will the front end include the selected output in the payload response at submission

AA Implementation

Expectations after completion:

Users will be able to construct a pipeline consisting of generic components and airflow components
Users will be able to exchange information between airflow component -> airflow component via xcoms
Airflow components will be expected to be subclassed from the airflow BaseOperator
Airflow component xcom pushes will be limited to the single default value return_value
Users will continue to be able to exchange input and output between generic components -> generic components
Users will not be able to exchange inputs and outputs between generic components -> airflow components and vice versa
Users will not be able to include any local dependencies/files with airflow components
All Airflow operator parameters inputs will be able to take an xcom as its value. Airflow does not make the distinction through parameter typing (unlike kfp with inputpath and outputpath)
- This complicates the UI a bit, every node property in the UI will need to be able to take in an output from a parent node
- Typing becomes more complicated, the UI receives information about what types of properties the component contains and renders accordingly but this change will require that we define all properties of “placeholder” type for xcom as well as their original type.
The data exchange format will differ little from the kfp format, only really requiring the node_id.

               "type":"execution_node",
               "op":"bash-operator_BashOperator",
               "app_data":{
                  "label":"BashOperator",
                  "component_parameters":{
                     "runtime_image":"alpine:latest",
                     "component_source":"https://raw.githubusercontent.com/apache/airflow/1.10.15/airflow/operators/bash_operator.py",
                     "component_source_type":"url",
                     "bash_command": {
                              "node_id": "01ea88e3-3f21-4dd0-a526-b6bd09792b01",
                              "output_key": ""
                     "xcom_push":true,
                     "env":"{\"TEST_ENV\": \"Hello World\"}",
                     "output_encoding":"utf-8",

AA - Node property types can double as both their original primitive/non primitive types as well as an xcom reference for parent outputs. Can we have a custom controller that will have the option to take in both these values but only in an either/or scenario? e.g.
AA - Type checking needs to be updated for all properties since they can double as xcom pulls as well as their original types

Issue Analytics

State:
Created 2 years ago
Comments:20 (20 by maintainers)

Top GitHub Comments

1reaction

akchinSTCcommented, Aug 31, 2021

Current response when querying the component properties registry api

{
    "current_parameters": {
        "label": "",
        "component_source": "kfp/run_notebook_using_papermill.yaml",
        "elyra_notebook": "",
        "elyra_parameters": "{}",
        "elyra_packages_to_install": "[]",
        "elyra_input_data": ""
    },
    "parameters": [
        {
            "id": "label"
        },
        {
            "id": "component_source"
        },
        {
            "id": "elyra_notebook"
        },
        {
            "id": "elyra_parameters"
        },
        {
            "id": "elyra_packages_to_install"
        },
        {
            "id": "elyra_input_data"
        }
    ],
    "uihints": {
        "id": "nodeProperties",
        "parameter_info": [
            {
                "parameter_ref": "label",
                "control": "custom",
                "custom_control_id": "StringControl",
                "label": {
                    "default": "Label"
                },
                "description": {
                    "default": "A custom label for the node.",
                    "placement": "on_panel"
                },
                "data": {}
            },
            {
                "parameter_ref": "component_source",
                "control": "readonly",
                "label": {
                    "default": "Component Source"
                },
                "description": {
                    "default": "The path to the component specification file.",
                    "placement": "on_panel"
                },
                "data": {}
            },
            {
                "parameter_ref": "elyra_notebook",
                "control": "custom",
                "custom_control_id": "StringControl",
                "label": {
                    "default": "Notebook"
                },
                "description": {
                    "default": "Required. Notebook to execute. (type: JupyterNotebook)",
                    "placement": "on_panel"
                },
                "data": {
                    "format": "file",
                    "required": true
                }

Modification of the information between the front end and the component registry to accommodate the inputs and outputs per component
- ui_hints -> parameter_info -> data -> format - for inputs will need to change to something that denotes a drop down selection, this will be the truth to determine whether a field is an inputPath
- ui_hints -> parameter_info -> control - for outputs will need to change to readonly so users can still see what outputs a component provides
- parameters - this section comes as a list ids. This listing of parameters is currently prepended with elyra_ to denote that its a component parameter, but is unable to show whether it is an input or an output. Need a little clarification from @marthacryan? as to how the front end consumes these id values and whether or not we can add additional information like { "id": "elyra_notebook", "type": "input"} or if we need to do some additional prepending to parse out later e.g. elyra_input_notebook

1reaction

ptitzlercommented, Aug 30, 2021

[We may need to just document this behavior somewhere] so when the user links two node’s together via the properties inputs and outputs, but doesn’t link those two nodes together via the “drag and connect”,

I don’t think this should be allowed because it would unnecessarily complicate things for the user and the VPE implementation:

implicit dependencies would have to be explicitly visualized in the graph, which would likely produce a graph rendering that is hard to consume
a node’s input selection options would have to include the outputs of every pipeline node and increase the likelihood that circular dependencies are created

Top Results From Across the Web

kfp.components package — Kubeflow Pipelines documentation

The version of the component referenced by a tag can change in future. Returns: ... Allows adding arbitrary key-value data to the component...

Building Python function-based components - Kubeflow

Learn more about passing data between components. Your function's inputs and outputs must meet the following requirements:.

Use Google Cloud Pipeline Components | Vertex AI

Learn more about using a custom service account and configuring a service account for use with Vertex AI Pipelines. Use VPC Service Controls...

How to pass data or files between Kubeflow containerized ...

Files created in one Kubeflow pipeline component are local to the container. To reference it in the subsequent steps, you would need to...

Chapter 4. Kubeflow Pipelines - O'Reilly

In the previous chapter we described Kubeflow Pipelines, the component of Kubeflow ... We'll explore how to transfer data between stages, then continue...