question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Allow for data exchange between custom KFP components

See original GitHub issue

Describe the issue

Currently the VPE does not expose KFP component outputs, making it impossible for users to choose them as inputs for subsequent nodes. In this example, get lines produces a file that count lines needs to access.

image

A working DSL implementation of this pipeline can be found here: https://github.com/ptitzler/kfp-component-tests/blob/main/pipelines/create_and_run_pipeline-1.py

To Reproduce Steps to reproduce the behavior:

  1. Register https://raw.githubusercontent.com/ptitzler/kfp-component-tests/main/example-1/component.yaml as a KFP component (Get Lines)
  2. Register https://raw.githubusercontent.com/ptitzler/kfp-component-tests/main/example-2/component.yaml as a KFP component (Count Lines)
  3. Create a pipeline from the two components as shown above.

Expected behavior

DESIGN

========================================================================================== Motivation

Currently, Elyra supports a handful of sample components from Apache Airflow and Kubeflow Pipelines. These components demonstrate Elyra’s ability to use native concepts from each orchestrator, however a key portion of their functionality is missing, notably the ability to pass data and/or parameters from one component/operator to another via inputs and outputs

Considerations

We want to limit the scope of the issue to just the exchange of data between runtime native components. That is, for the time being, support data exchanging between Airflow operators -> Airflow operators and KFP components -> KFP components.

We support both Apache Airflow and Kubeflow Pipelines but both runtimes have very different ways of defining inputs and outputs.

Apache Airflow

Apache Airflow uses the concept of Xcoms or Cross Communication. Xcoms are small amounts of data that are shared between tasks (nodes). The data is represented by a key-value pair with the key being a string and a value that is serializable in JSON or pickable(pickle). These Xcoms can be pushed and pulled between tasks and by default are scoped to the DAG run (pipeline run)

Xcoms are built into the Airflow BaseOperator so all operators inherit them and are accessed via the task_instance(ti) object and xcom_push and xcom_pull helper methods.

t1 = BashOperator(
    task_id="t1",
    bash_command='echo "{{ ti.xcom_push(key="k1", value="v1") }}" "{{ti.xcom_push(key="k2", value="v2") }}"',
    dag=dag,
)
t2 = BashOperator(
    task_id="t2",
    bash_command='echo "{{ ti.xcom_pull(key="k1") }}" "{{ ti.xcom_pull(key="k2") }}"',
    dag=dag,
)
t1 >> t2

Limitations:

Note that there are size limitations to the amount of data that can be passed via Xcoms. Best practices seems to suggest that objects around a few MBs are ok to pass via Xcoms but anything larger should be handled via by file path reference (volumes, s3)

Resources: A good guide : https://marclamberti.com/blog/airflow-xcom/

Kubeflow Pipelines

Elyra uses KFP component definitions when considering how it handles input and outputs and how to share data. Inputs and Outputs are specified in the component definition under each respective name and are then used in the implementation with type hinting(inputPath, inputValue, outputPath) to describe how each argument should be processed, either by reference(*Path) or value(*Value).

name: Truncate file
description: Gets the specified number of lines from the input file.

inputs:
- {name: Input 1, type: String, optional: false, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', optional: true, description: 'Number of lines to keep'}

outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}

implementation:
  container:
    image: quay.io/ptitzler/kfp-ex-truncate-file@sha256:37e20c5f5daae264a05f7bb595aac19ebd7b045667b7056ba3a13fda1b86746e
    # command is a list of strings (command-line arguments). 
    # The YAML language has two syntaxes for lists and you can use either of them. 
    # Here we use the "flow syntax" - comma-separated strings inside square brackets.
    command: [
      python3, 
      # Path of the program inside the container
      /pipelines/component/src/truncate-file.py,
      --input1-path,
      {inputPath: Input 1},
      --param1, 
      {inputValue: Parameter 1},
      --output1-path, 
      {outputPath: Output 1},
    ]

Using the truncate example above:

The following inputs and outputs would appear as a properties in the node properties pane in the pipeline editor

inputs:
- {name: Input 1, type: String, optional: false, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', optional: true, description: 'Number of lines to keep'}

outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}
  • All inputs and outputs need to be pre-defined in the component definition. There are 3 types of input/output fields. These fields are hints and used to tell KFP how to ultimately use them during compilation.
python3, 
      # Path of the program inside the container
      /pipelines/component/src/truncate-file.py,
      --input1-path,
      {inputPath: Input 1},
      --param1, 
      {inputValue: Parameter 1},
      --output1-path, 
      {outputPath: Output 1},

Limitations: Best practices indicate that users should limit the amount of data passed by value to 200KB per pipeline run.

Envisioned workflow Given that the inputs and outputs are all defined prior to submission, we should be able to translate them into the appropriate fields for each runtime in order to properly execute the pipeline.

  • A user would be able to define both the input(s) and output(s) in the node properties pane
  • Outputs defined in upstream nodes would be made available to downstream nodes in the UI

Example scenarios (Cn = <component N>, On = <output N>, In <input N>)

  • C1() -> C2(I1) => For C2 user can

    • do nothing
    • enter the value of I1
  • C1(O1,O2) -> C2 (I1) => For C2 user can

    • do nothing
    • enter the value of I1
    • select O1 or O2 as the value of I1
  • C1(O1,O2) -> C2 (I1,O3) -> C3(I2) => for C3 user can

    • do nothing
    • enter the value of I2
    • select O1 or O2 as the value of I2
    • select O3 as the value of I2

Example: image

==========================================================================================

KFP Implementation

Expectations after completion:

  • Users will be able to construct a pipeline consisting of both generic and kfp components
  • Users will only be able to configure component parameters that are typed as inputValues from the VPE
  • Users will be able to exchange input and output between kfp components -> kfp components
  • Users will continue to be able to exchange input and output between generic components -> generic components
  • Users will not be able to add outputs from any nodes not explicitly linked in the visual pipeline editor
  • Users will not be able to exchange inputs and outputs between generic components -> kfp components and vice versa
  • Users will not be able to explicitly pass in data as values into a component parameter designated as a inputPath
  • Users will not be able to include any local dependencies/files with kfp components

Open Discussions:

  • How to handle naming conflicts between inputs and outputs e.g. papermill notebook name is same in both input and output.
    • Prefix (output_*) vs. adding sub stanza [‘inputs’] and [‘outputs’] to [‘app_data’][‘component_parameters’] in payload to processor?
  • How will the front end determine available list of outputs for downstream nodes and how will the front end include the selected output in the payload response at submission

AA Implementation

Expectations after completion:

  • Users will be able to construct a pipeline consisting of generic components and airflow components

  • Users will be able to exchange information between airflow component -> airflow component via xcoms

  • Airflow components will be expected to be subclassed from the airflow BaseOperator

  • Airflow component xcom pushes will be limited to the single default value return_value

  • Users will continue to be able to exchange input and output between generic components -> generic components

  • Users will not be able to exchange inputs and outputs between generic components -> airflow components and vice versa

  • Users will not be able to include any local dependencies/files with airflow components

  • All Airflow operator parameters inputs will be able to take an xcom as its value. Airflow does not make the distinction through parameter typing (unlike kfp with inputpath and outputpath)

    • This complicates the UI a bit, every node property in the UI will need to be able to take in an output from a parent node
    • Typing becomes more complicated, the UI receives information about what types of properties the component contains and renders accordingly but this change will require that we define all properties of “placeholder” type for xcom as well as their original type.
  • The data exchange format will differ little from the kfp format, only really requiring the node_id.

               "type":"execution_node",
               "op":"bash-operator_BashOperator",
               "app_data":{
                  "label":"BashOperator",
                  "component_parameters":{
                     "runtime_image":"alpine:latest",
                     "component_source":"https://raw.githubusercontent.com/apache/airflow/1.10.15/airflow/operators/bash_operator.py",
                     "component_source_type":"url",
                     "bash_command": {
                              "node_id": "01ea88e3-3f21-4dd0-a526-b6bd09792b01",
                              "output_key": ""
                     "xcom_push":true,
                     "env":"{\"TEST_ENV\": \"Hello World\"}",
                     "output_encoding":"utf-8",
  • AA - Node property types can double as both their original primitive/non primitive types as well as an xcom reference for parent outputs. Can we have a custom controller that will have the option to take in both these values but only in an either/or scenario? e.g. image
  • AA - Type checking needs to be updated for all properties since they can double as xcom pulls as well as their original types

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:20 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
akchinSTCcommented, Aug 31, 2021

Current response when querying the component properties registry api

{
    "current_parameters": {
        "label": "",
        "component_source": "kfp/run_notebook_using_papermill.yaml",
        "elyra_notebook": "",
        "elyra_parameters": "{}",
        "elyra_packages_to_install": "[]",
        "elyra_input_data": ""
    },
    "parameters": [
        {
            "id": "label"
        },
        {
            "id": "component_source"
        },
        {
            "id": "elyra_notebook"
        },
        {
            "id": "elyra_parameters"
        },
        {
            "id": "elyra_packages_to_install"
        },
        {
            "id": "elyra_input_data"
        }
    ],
    "uihints": {
        "id": "nodeProperties",
        "parameter_info": [
            {
                "parameter_ref": "label",
                "control": "custom",
                "custom_control_id": "StringControl",
                "label": {
                    "default": "Label"
                },
                "description": {
                    "default": "A custom label for the node.",
                    "placement": "on_panel"
                },
                "data": {}
            },
            {
                "parameter_ref": "component_source",
                "control": "readonly",
                "label": {
                    "default": "Component Source"
                },
                "description": {
                    "default": "The path to the component specification file.",
                    "placement": "on_panel"
                },
                "data": {}
            },
            {
                "parameter_ref": "elyra_notebook",
                "control": "custom",
                "custom_control_id": "StringControl",
                "label": {
                    "default": "Notebook"
                },
                "description": {
                    "default": "Required. Notebook to execute. (type: JupyterNotebook)",
                    "placement": "on_panel"
                },
                "data": {
                    "format": "file",
                    "required": true
                }
  • Modification of the information between the front end and the component registry to accommodate the inputs and outputs per component
    • ui_hints -> parameter_info -> data -> format - for inputs will need to change to something that denotes a drop down selection, this will be the truth to determine whether a field is an inputPath
    • ui_hints -> parameter_info -> control - for outputs will need to change to readonly so users can still see what outputs a component provides
    • parameters - this section comes as a list ids. This listing of parameters is currently prepended with elyra_ to denote that its a component parameter, but is unable to show whether it is an input or an output. Need a little clarification from @marthacryan? as to how the front end consumes these id values and whether or not we can add additional information like { "id": "elyra_notebook", "type": "input"} or if we need to do some additional prepending to parse out later e.g. elyra_input_notebook
1reaction
ptitzlercommented, Aug 30, 2021

[We may need to just document this behavior somewhere] so when the user links two node’s together via the properties inputs and outputs, but doesn’t link those two nodes together via the “drag and connect”,

I don’t think this should be allowed because it would unnecessarily complicate things for the user and the VPE implementation:

  • implicit dependencies would have to be explicitly visualized in the graph, which would likely produce a graph rendering that is hard to consume
  • a node’s input selection options would have to include the outputs of every pipeline node and increase the likelihood that circular dependencies are created
Read more comments on GitHub >

github_iconTop Results From Across the Web

kfp.components package — Kubeflow Pipelines documentation
The version of the component referenced by a tag can change in future. Returns: ... Allows adding arbitrary key-value data to the component...
Read more >
Building Python function-based components - Kubeflow
Learn more about passing data between components. Your function's inputs and outputs must meet the following requirements:.
Read more >
Use Google Cloud Pipeline Components | Vertex AI
Learn more about using a custom service account and configuring a service account for use with Vertex AI Pipelines. Use VPC Service Controls...
Read more >
How to pass data or files between Kubeflow containerized ...
Files created in one Kubeflow pipeline component are local to the container. To reference it in the subsequent steps, you would need to...
Read more >
Chapter 4. Kubeflow Pipelines - O'Reilly
In the previous chapter we described Kubeflow Pipelines, the component of Kubeflow ... We'll explore how to transfer data between stages, then continue...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found