Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Filepath handling for generic and custom components

See original GitHub issue

Please let me know of any mistakes/misunderstandings or missing anything

Before the introduction of custom components, there were 3 properties that could contain file paths:

filename
dependencies
outputs

To discuss the subtleties in file path handling I will define several scenarios:

persistence: Anytime the pipeline contents are sitting at rest in a file
UI time: When the pipeline is loaded into the UI (basically runtime, but want to avoid more uses of this word)
submission API: When submitting a pipeline for export or running via the submit/export button in the UI
CLI: When interacting with a pipeline file via the CLI

When a pipeline file is persisted:

filename is the relative path to the file from the pipeline
dependencies and outputs are the relative paths to the file from the above stated filename property

At UI time these properties are unaltered:

filename is the relative path to the file from the pipeline
dependencies and outputs are the relative paths to the file from the above stated filename property

When using the submission API a copy of the pipeline is made and its paths are updated. This copy of the pipeline does not make its way back the UI time or persisted, it is only sent as a request to the server.

filename is the relative path to the workspace(dir where JupyterLab was started)
dependencies and outputs are the relative paths to the file from the above stated filename property

When submitting a pipeline with the CLI the pipeline path provided is used to load in the pipeline JSON. A copy of the pipeline is made and its paths are updated. This copy of the pipeline does not get persisted.

filename is the relative path to where the CLI was called
dependencies and outputs are the relative paths to the file from the above stated filename property

Why use relative to pipeline when persisted?

Storing paths relative to the pipeline allows for the files to be found no matter where JupyterLab is started from as well as keeping pipelines sharable.

Why are the paths updated to be relative to the workspace directory for submission?

When a pipeline is submitted, the backend isn’t aware of the existence of the pipeline file. The backend only receives a JSON payload of the pipeline contents. This allows pipelines that don’t actually exist on the filesystem to be run. An example of this is running a single notebook as a pipeline. When running a notebook as pipeline, a fake pipeline JSON is created and sent to the server. The drawback of this however, is that since all paths are relative to the pipeline, the backend can no longer find the files because the backend doesn’t know where the pipeline is (or the pipeline file doesn’t even exist in the case of the single notebook run). For that reason, before submission, all paths need to be updated to use a path that is relative to the workspace directory (essentially an absolute path in the eyes of JupyerLab) so that the server can find the appropriate files

Why aren’t `dependencies` and `outputs` updated?

The dependencies need to show up in the container at the path explicitly specified in properties. That is why no modifications can be made at submission time. However, the backend still needs to find where the packages are physically located at submission. Since dependencies are relative to filename, it can use both of these informations to determine the location of the file. The outputs don’t exist at submission time and need to remain as the path explicitly specified in properties so that the files can be found in the container when generated.

The Problem

With the introduction of custom components an issue has begun to surface. There is no way to tell when to adjust a path. Custom components need to be able to work out of the box. This means we can’t rely on additional metadata that says “x” properties are special and should be handled in a different way. With KFP components types are optional. This means we can’t be sure if something is a path or not. Also, I believe @ptitzler mentioned that everything can always be a path, but it could also be another type that is saved to a file and passed as a path to the node.

For example: https://github.com/elyra-ai/elyra/blob/3a94434db73341526f53b425ee4fed8d87a9dce6/etc/config/components/kfp/filter_text_using_shell_and_grep.yaml

This component expects an input Text. This input will be passed to the node as a path to the file containing the Text to filter. However, the data passed to this node could either be the path to a file OR the raw text (that will be magically transformed into a file and passed as a path)

I’m not sure if I’m interpreting this correctly, but it’s seems similar to what they describe here: https://notebook.community/kubeflow/kfp-tekton-backend/samples/tutorials/Data passing in python components

If this is correct, it is dangerous to assume something is a path and try to modify it on the frontend before submission. However if we don’t modify it, the backend might only work by chance (if the pipeline file happens to be located at the root of the workspace directory)

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

2reactions

bourdakos1commented, Aug 2, 2021

After some discussion, we have come up with a possible solution of introducing “property types”. Each property could ALWAYS either be a file, raw data, or output from another node (the type of property would be determined by the user not the component). The UI would display a special input control for each property that would let the user choose to use a file picker, enter raw data manually (this would still show things like a checkbox for a boolean type), or choose an output coming from an ancestor node.

The short term solution will be to use inputPath/inputValue to determine if a property will be a “file”, NOT type. We can then safely transform the path to be relative to the workspace during submission.

0reactions

ajbozarthcommented, Sep 2, 2021

As stated in the closed PR:

Based on discussions with @akchinSTC in a Webex this PR becomes unnecessary based on a new understanding of kfp components as found during work on #1761

In particular we have found that kfp components will never have “representative” files, they will always get a file from a previous node, and nodes can only “represent” files found in locations like s3, not locally.

Top Results From Across the Web

File path formats on Windows systems | Microsoft Learn

In this article, learn about file path formats on Windows systems, such as traditional DOS paths, DOS device paths, and universal naming ...

File and Directory Names: File, Path, Paths (Java Files Tutorial)

This provides a whole new set of classes for handling files (introduced in the previous article about writing and reading files). Files and ......

What is the naming standard for path components?

But yes if you have to be specific about the type of path, its better to name then file name or directory instead...

Pipeline components — Elyra 3.14.0.dev0 documentation

Elyra includes three generic components that allow for the processing of Jupyter ... Custom components are commonly only implemented for one runtime type, ......

Generic Custom Component - ESPHome

This integration can be used to create generic custom components in ESPHome using the C++ (Arduino) API. This integration should be used in...