Filepath handling for generic and custom components
See original GitHub issuePlease let me know of any mistakes/misunderstandings or missing anything
Before the introduction of custom components, there were 3 properties that could contain file paths:
filename
dependencies
outputs
To discuss the subtleties in file path handling I will define several scenarios:
persistence
: Anytime the pipeline contents are sitting at rest in a fileUI time
: When the pipeline is loaded into the UI (basically runtime, but want to avoid more uses of this word)submission API
: When submitting a pipeline for export or running via thesubmit
/export
button in the UICLI
: When interacting with a pipeline file via the CLI
When a pipeline file is persisted
:
filename
is the relative path to the file from the pipelinedependencies
andoutputs
are the relative paths to the file from the above statedfilename
property
At UI time
these properties are unaltered:
filename
is the relative path to the file from the pipelinedependencies
andoutputs
are the relative paths to the file from the above statedfilename
property
When using the submission API
a copy of the pipeline is made and its paths are updated. This copy of the pipeline does not make its way back the UI time
or persisted
, it is only sent as a request to the server.
filename
is the relative path to the workspace(dir where JupyterLab was started)dependencies
andoutputs
are the relative paths to the file from the above statedfilename
property
When submitting a pipeline with the CLI
the pipeline path provided is used to load in the pipeline JSON. A copy of the pipeline is made and its paths are updated. This copy of the pipeline does not get persisted
.
filename
is the relative path to where the CLI was calleddependencies
andoutputs
are the relative paths to the file from the above statedfilename
property
Why use relative to pipeline when persisted?
Storing paths relative to the pipeline allows for the files to be found no matter where JupyterLab is started from as well as keeping pipelines sharable.
Why are the paths updated to be relative to the workspace directory for submission?
When a pipeline is submitted, the backend isn’t aware of the existence of the pipeline file. The backend only receives a JSON payload of the pipeline contents. This allows pipelines that don’t actually exist on the filesystem to be run. An example of this is running a single notebook as a pipeline. When running a notebook as pipeline, a fake pipeline JSON is created and sent to the server. The drawback of this however, is that since all paths are relative to the pipeline, the backend can no longer find the files because the backend doesn’t know where the pipeline is (or the pipeline file doesn’t even exist in the case of the single notebook run). For that reason, before submission, all paths need to be updated to use a path that is relative to the workspace directory (essentially an absolute path in the eyes of JupyerLab) so that the server can find the appropriate files
Why aren’t dependencies
and outputs
updated?
The dependencies
need to show up in the container at the path explicitly specified in properties. That is why no modifications can be made at submission time. However, the backend still needs to find where the packages are physically located at submission. Since dependencies
are relative to filename
, it can use both of these informations to determine the location of the file.
The outputs
don’t exist at submission time and need to remain as the path explicitly specified in properties so that the files can be found in the container when generated.
The Problem
With the introduction of custom components an issue has begun to surface. There is no way to tell when to adjust a path. Custom components need to be able to work out of the box. This means we can’t rely on additional metadata that says “x” properties are special and should be handled in a different way. With KFP components types
are optional. This means we can’t be sure if something is a path or not. Also, I believe @ptitzler mentioned that everything can always be a path, but it could also be another type that is saved to a file and passed as a path to the node.
This component expects an input Text
. This input will be passed to the node as a path to the file containing the Text
to filter. However, the data passed to this node could either be the path to a file OR the raw text (that will be magically transformed into a file and passed as a path)
I’m not sure if I’m interpreting this correctly, but it’s seems similar to what they describe here: https://notebook.community/kubeflow/kfp-tekton-backend/samples/tutorials/Data passing in python components
If this is correct, it is dangerous to assume something is a path and try to modify it on the frontend before submission. However if we don’t modify it, the backend might only work by chance (if the pipeline file happens to be located at the root of the workspace directory)
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
After some discussion, we have come up with a possible solution of introducing “property types”. Each property could ALWAYS either be a file, raw data, or output from another node (the type of property would be determined by the user not the component). The UI would display a special input control for each property that would let the user choose to use a file picker, enter raw data manually (this would still show things like a checkbox for a boolean type), or choose an output coming from an ancestor node.
The short term solution will be to use inputPath/inputValue to determine if a property will be a “file”, NOT type. We can then safely transform the path to be relative to the workspace during submission.
As stated in the closed PR:
Based on discussions with @akchinSTC in a Webex this PR becomes unnecessary based on a new understanding of kfp components as found during work on #1761
In particular we have found that kfp components will never have “representative” files, they will always get a file from a previous node, and nodes can only “represent” files found in locations like s3, not locally.