Allow for data exchange between custom KFP components
See original GitHub issueDescribe the issue
Currently the VPE does not expose KFP component outputs, making it impossible for users to choose them as inputs for subsequent nodes. In this example, get lines
produces a file that count lines
needs to access.
A working DSL implementation of this pipeline can be found here: https://github.com/ptitzler/kfp-component-tests/blob/main/pipelines/create_and_run_pipeline-1.py
To Reproduce Steps to reproduce the behavior:
- Register https://raw.githubusercontent.com/ptitzler/kfp-component-tests/main/example-1/component.yaml as a KFP component (
Get Lines
) - Register https://raw.githubusercontent.com/ptitzler/kfp-component-tests/main/example-2/component.yaml as a KFP component (
Count Lines
) - Create a pipeline from the two components as shown above.
Expected behavior
- A user should be able to select any compatible output from a parent node as an input. (see example scenarios listed in https://github.com/elyra-ai/elyra/issues/1761#issuecomment-872596503)
DESIGN
========================================================================================== Motivation
Currently, Elyra supports a handful of sample components from Apache Airflow and Kubeflow Pipelines. These components demonstrate Elyra’s ability to use native concepts from each orchestrator, however a key portion of their functionality is missing, notably the ability to pass data and/or parameters from one component/operator to another via inputs and outputs
Considerations
We want to limit the scope of the issue to just the exchange of data between runtime native components. That is, for the time being, support data exchanging between Airflow operators -> Airflow operators and KFP components -> KFP components.
We support both Apache Airflow and Kubeflow Pipelines but both runtimes have very different ways of defining inputs and outputs.
Apache Airflow
Apache Airflow uses the concept of Xcoms or Cross Communication. Xcoms are small amounts of data that are shared between tasks (nodes). The data is represented by a key-value pair with the key being a string and a value that is serializable in JSON or pickable(pickle). These Xcoms can be pushed and pulled between tasks and by default are scoped to the DAG run (pipeline run)
Xcoms are built into the Airflow BaseOperator so all operators inherit them and are accessed via the task_instance(ti) object and xcom_push and xcom_pull helper methods.
t1 = BashOperator(
task_id="t1",
bash_command='echo "{{ ti.xcom_push(key="k1", value="v1") }}" "{{ti.xcom_push(key="k2", value="v2") }}"',
dag=dag,
)
t2 = BashOperator(
task_id="t2",
bash_command='echo "{{ ti.xcom_pull(key="k1") }}" "{{ ti.xcom_pull(key="k2") }}"',
dag=dag,
)
t1 >> t2
Limitations:
Note that there are size limitations to the amount of data that can be passed via Xcoms. Best practices seems to suggest that objects around a few MBs are ok to pass via Xcoms but anything larger should be handled via by file path reference (volumes, s3)
Resources: A good guide : https://marclamberti.com/blog/airflow-xcom/
Kubeflow Pipelines
Elyra uses KFP component definitions when considering how it handles input and outputs and how to share data. Inputs and Outputs are specified in the component definition under each respective name and are then used in the implementation with type hinting(inputPath, inputValue, outputPath) to describe how each argument should be processed, either by reference(*Path) or value(*Value).
name: Truncate file
description: Gets the specified number of lines from the input file.
inputs:
- {name: Input 1, type: String, optional: false, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', optional: true, description: 'Number of lines to keep'}
outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}
implementation:
container:
image: quay.io/ptitzler/kfp-ex-truncate-file@sha256:37e20c5f5daae264a05f7bb595aac19ebd7b045667b7056ba3a13fda1b86746e
# command is a list of strings (command-line arguments).
# The YAML language has two syntaxes for lists and you can use either of them.
# Here we use the "flow syntax" - comma-separated strings inside square brackets.
command: [
python3,
# Path of the program inside the container
/pipelines/component/src/truncate-file.py,
--input1-path,
{inputPath: Input 1},
--param1,
{inputValue: Parameter 1},
--output1-path,
{outputPath: Output 1},
]
Using the truncate
example above:
The following inputs and outputs would appear as a properties in the node properties pane in the pipeline editor
inputs:
- {name: Input 1, type: String, optional: false, description: 'Data for input 1'}
- {name: Parameter 1, type: Integer, default: '100', optional: true, description: 'Number of lines to keep'}
outputs:
- {name: Output 1, type: String, description: 'Output 1 data.'}
- All inputs and outputs need to be pre-defined in the component definition. There are 3 types of input/output fields. These fields are hints and used to tell KFP how to ultimately use them during compilation.
python3,
# Path of the program inside the container
/pipelines/component/src/truncate-file.py,
--input1-path,
{inputPath: Input 1},
--param1,
{inputValue: Parameter 1},
--output1-path,
{outputPath: Output 1},
inputValue:
- used for direct intake of parameters or small bits of data, these parameters are passed directly as values to the application in the containerinputPath:
- Runtime inputs - these inputs are read in from upstream sources created during pipeline execution e.g. nodeA.outputs[‘output_1’] See https://github.com/akchinSTC/elyra/blob/e0f5ef0234d0ad79a4cb848eb108ac00dbe0af64/etc/test_download_and_count.py#L25-L27
outputPath
- the location of where the application within the image writes its data.- Output Paths cannot be set explicitly. These are handled by KFP exclusively. https://stackoverflow.com/questions/67241248/how-to-use-outputpath-across-multiple-components-in-kubeflow. This means that any outputs from components should be defined at the definition level and cannot be modified at runtime. The registry should still make these outputs available to the front end when constructing the pipeline
- Each of these input/output fields are also defined as types:
- Int, String, Float, Bool, List, Dict, LocalPath
Limitations: Best practices indicate that users should limit the amount of data passed by value to 200KB per pipeline run.
Envisioned workflow Given that the inputs and outputs are all defined prior to submission, we should be able to translate them into the appropriate fields for each runtime in order to properly execute the pipeline.
- A user would be able to define both the input(s) and output(s) in the node properties pane
- Outputs defined in upstream nodes would be made available to downstream nodes in the UI
Example scenarios (
Cn
=<component N>
,On
=<output N>
,In
<input N>
)
C1()
->C2(I1)
=> For C2 user can
- do nothing
- enter the value of
I1
C1(O1,O2)
->C2 (I1)
=> For C2 user can
- do nothing
- enter the value of
I1
- select
O1
orO2
as the value ofI1
C1(O1,O2)
->C2 (I1,O3)
->C3(I2)
=> for C3 user can
- do nothing
- enter the value of
I2
- select
O1
orO2
as the value ofI2
- select
O3
as the value ofI2
Example:
==========================================================================================
KFP Implementation
Expectations after completion:
- Users will be able to construct a pipeline consisting of both generic and kfp components
- Users will only be able to configure component parameters that are typed as
inputValue
s from the VPE - Users will be able to exchange input and output between
kfp components
->kfp components
- Users will continue to be able to exchange input and output between
generic components
->generic components
- Users will not be able to add
outputs
from any nodes not explicitly linked in the visual pipeline editor - Users will not be able to exchange inputs and outputs between
generic components
->kfp components
and vice versa - Users will not be able to explicitly pass in data as values into a component parameter designated as a
inputPath
- Users will not be able to include any local dependencies/files with kfp components
Open Discussions:
- How to handle naming conflicts between inputs and outputs e.g. papermill notebook name is same in both input and output.
- Prefix (output_*) vs. adding sub stanza [‘inputs’] and [‘outputs’] to [‘app_data’][‘component_parameters’] in payload to processor?
- How will the front end determine available list of outputs for downstream nodes and how will the front end include the selected output in the payload response at submission
AA Implementation
Expectations after completion:
-
Users will be able to construct a pipeline consisting of generic components and airflow components
-
Users will be able to exchange information between
airflow component
->airflow component
viaxcoms
-
Airflow components will be expected to be subclassed from the airflow
BaseOperator
-
Airflow component xcom pushes will be limited to the single default value
return_value
-
Users will continue to be able to exchange input and output between
generic components
->generic components
-
Users will not be able to exchange inputs and outputs between
generic components
->airflow components
and vice versa -
Users will not be able to include any local dependencies/files with
airflow components
-
All Airflow operator parameters inputs will be able to take an xcom as its value. Airflow does not make the distinction through parameter typing (unlike kfp with inputpath and outputpath)
- This complicates the UI a bit, every node property in the UI will need to be able to take in an output from a parent node
- Typing becomes more complicated, the UI receives information about what types of properties the component contains and renders accordingly but this change will require that we define all properties of “placeholder” type for xcom as well as their original type.
-
The data exchange format will differ little from the kfp format, only really requiring the
node_id
.
"type":"execution_node",
"op":"bash-operator_BashOperator",
"app_data":{
"label":"BashOperator",
"component_parameters":{
"runtime_image":"alpine:latest",
"component_source":"https://raw.githubusercontent.com/apache/airflow/1.10.15/airflow/operators/bash_operator.py",
"component_source_type":"url",
"bash_command": {
"node_id": "01ea88e3-3f21-4dd0-a526-b6bd09792b01",
"output_key": ""
"xcom_push":true,
"env":"{\"TEST_ENV\": \"Hello World\"}",
"output_encoding":"utf-8",
- AA - Node property types can double as both their original primitive/non primitive types as well as an xcom reference for parent outputs. Can we have a custom controller that will have the option to take in both these values but only in an either/or scenario? e.g.
- AA - Type checking needs to be updated for all properties since they can double as xcom pulls as well as their original types
Issue Analytics
- State:
- Created 2 years ago
- Comments:20 (20 by maintainers)
Top GitHub Comments
Current response when querying the component properties registry api
inputPath
readonly
so users can still see what outputs a component providesid
s. This listing of parameters is currently prepended withelyra_
to denote that its a component parameter, but is unable to show whether it is an input or an output. Need a little clarification from @marthacryan? as to how the front end consumes theseid
values and whether or not we can add additional information like{ "id": "elyra_notebook", "type": "input"}
or if we need to do some additional prepending to parse out later e.g.elyra_input_notebook
I don’t think this should be allowed because it would unnecessarily complicate things for the user and the VPE implementation: