question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Proposal] Introduce the notion of Runtime Processor Type

See original GitHub issue

This issue presents a proposal for introducing the notion of a Runtime Processor Type so that implementations of the same runtime can be more easily distinguished.

Definition: The following uses the term runtime processor. This term refers to the platform or orchestration tool that is driving the execution of a given pipeline. Support for two common runtime processors is embedded in Elyra. These are Apache Airflow and Kubeflow, although others exist outside of Elyra they could be implemented using our BYO model.

Problem: With the ability to bring your own runtimes and, as of #2241, bring your own component catalog connectors, it is important that we have the ability to specify that a given entity supports a type of runtime processor. Today, Elyra only defines runtime processor names. Although each instance of a PipelineProcessor has a type property, that property value is actually the name of a runtime configuration schema and not actually a type of runtime processor.

For example, Elyra ships with two runtime schema definitions - airflow and kfp. Runtime configuration instances of these schemas can be created, but each schema equates to a specific implementation of a RuntimePipelineProcessor (or RPP) instance (which is a subclass of the aforementioned PipelineProcessor class). However, if someone wanted to bring their own implementation of RuntimePipelineProcessor that also used Kubeflow to drive the execution of the pipeline, there really isn’t a way for that implementation to indicate it too is a Kubeflow-based processor similar to the processor named kfp.

Likewise, Component Catalog Connectors (or CCCs) want the ability to state that the components served from their implementation support a particular processor type, like Kubeflow or Apache Airflow, irrespective of how many RPP implementations are registered.

As a result, we need to formally introduce the notion of runtime processor type.

Proposal: The first issue to introducing a runtime processor type is its general conflict to the bring your own paradigm. How can we possibly enumerate types - which must be known at the time of development - when we don’t know what runtime processor implementations we will have until run time? This can be solved by first defining the set of known processor types, irrespective of there being an implementation of that type. This must be clearly conveyed to users. Just because a specific processor type is defined does NOT imply that such an implementation is available or that one even exists.

A quick google search for runtime processor types yields this result. Here we find six task/orchestration tools including the three we have some knowledge about (Apache Airflow, Kubeflow, and Argo) with the others being Luigi, Prefect, and MLFlow.

We can add some of these as development-time processor types with the idea that anyone wishing to introduce a runtime processor implementation (i.e., BYO RPP) of an undefined (unlisted) type, must first introduce a pull request adding that type to the enumeration.

Implementation (high-level): We’d like a central location that houses the various processor types. The type names should have string representations that can be referenced in schemas and the UI. Ideally, simple comparisons could be made without regard to case. It should be easy for users to introduce a new type prior to implementing their Pipeline Processor or Catalog Connector. Python’s Enum construct seems like a nice fit. In particular, we should use the @unique attribute to ensure values are unique.

Using the types referenced in the google search result, we’d have an enum like the following:

@unique
class RuntimeProcessorType(Enum):
    Local = 1
    Kubeflow = 2
    ApacheAirflow = 4
    Argo = 8
    Luigi = 16
    Prefect = 32
    MLFlow = 64

There are a couple of items worth noting.

  1. The integer values correspond to different bits of an integer. I’m not entirely sure this is necessary but would allow for the composition of a set of processor types to be stored in an integer and simple bit manipulation applied to determine membership.
  2. I’ve added Local as a type - which I think we want to do - even though it’s not an official processor type in Elyra. In addition, the nice thing about using bits is that Local having a value of 1 would be the only odd value - which is somewhat fitting 😄, but also unnecessary.
  3. The downside to using bits is that we’d be limited to 31 task/orchestration engines, so it may be unnecessarily limiting, although, practically speaking, probably not an issue.

We could discuss this further, I don’t really have an affinity to the integer values.

Enum classes also have a built-in dictionary that uses the stringized value as a key and returns the enum instance. This dictionary is __members__, so, using the definition above, RuntimeProessorType.__members__.get('Kubeflow') will return RuntimeProcessorType.Kubeflow. We will certainly wrap __members__ access into a get_instance() kind of method.

The string-value is accessible via a built-in name property, so RuntimeProcessorType.Kubeflow.name yields 'Kubeflow'. Likewise, there’s a built-in value property where RuntimeProcessorType.Kubeflow.value yields 2. As a result, I think using an Enum subclass would give us the flexibility and central location we (and our users) would need.

Schemas of the Runtimes schemaspace will require a “type” property. This property will be a constant because each runtime schema is associated with exactly one processor type. This will require a migration, but that can be easily performed by introducing metadata_class_name values for each of our runtime schemas. When a given instance is located, the class implementation will check if there’s a processor_type field and, if not, inject that field, persist the update, the return from the load. (We should introduce a version field at this time as well.) This same migration approach is used in the Catalog Connector PR (#2241)

Users implementing their own Catalog Connectors should explicitly list the processor types they support. We could introduce the notion of a wildcard value (e.g., '*') to indicate any processor type is supported, but, given the potential plethora of task/orchestration engines, I think it would be best to be explicit. When those providers add support for another engine, they simply expose an updated schema whose enum-valued property contains the new reference. Likewise, they may choose to drop support for a given processor type.

There are locations within the server and UI where the processor name is used today. These will need to be updated and replaced with formal type-based names. In addition, we’ll want a new endpoint that can be used to retrieve the types for any registered runtime processors. For example, if there are two Kubeflow processor implementations registered (and nothing more), this endpoint would return Kubeflow (corresponding to the name property of RuntimeProcessorType.Kubeflow) and not the names of the two registered processors (i.e., schemas).

Alternative approach: Rather than introduce a RuntimeProcessorType enumeration, we could introduce type-based base classes that derive from RuntimePipelineProcessor. For example, an ApacheAirflowBase could be inserted between RuntimePipelineProcessor and AirflowPipelineProcessor. This would allow for Airflow-specific functionality that is agnostic to the actual implementations to be located. These base classes would then expose a processor_type property that reflects their type. In addition, code could also use isinstance(implementation_instance, ApacheAirflowBase) to determine “type”. The problem with this is that we’d still want to introduce “empty” implementations for future-supported types, even though one may never exist. This seems a little heavyweight. Another caveat is that the “types” are scattered about and not in a central location like that of an Enum class. As a result, references to multiple types would require imports for the various implementations - which is way too heavyweight.

Front-end/pipeline changes related to this proposal:

  1. The runtime platform icon/tile to display should be predicated on the runtime_type field from the schema within the Runtimes schemaspace. Currently, this determination is based on the name of the schema (kfp, airflow). If the runtime_type specifies KUBEFLOW_PIPELINES the Kubeflow icon is displayed, etc. The schema title should be used as the icon ‘name’ or hover information. The name may also serve that purpose.
  2. When a tile is selected to create a new pipeline, the pipeline contents should include both a runtime: property (which equates to the schema name, as is the case today) and a runtime_type: (sibling) property which reflects the schema’s runtime_type value.
  3. Pipeline files (today) contain the following information:
      "app_data": {
        "ui_data": {
          "comments": []
        },
        "version": 5,
        "runtime": "airflow",
        "properties": {
          "name": "untitled13",
          "runtime": "Apache Airflow"
        }
      },

It is confusing what the second (and embedded) runtime value is used for or why it is located in a sub-object. We should probably reformat this as:

      "app_data": {
        "ui_data": {
          "comments": []
        },
        "version": 5,
        "runtime": "airflow",
        "runtime_type": "APACHE_AIRFLOW",
        "name": "untitled13"
      },

barring reasons for the sub-object. Do other items get placed in the "properties": sub-object? Also, note the use of the enumerated type name field rather than the displayable value. Wherever items are persisted (like in the schema runtime_type field as well) we want to use the name so that a level of indirection is introduced for obtaining the values. This enables the values to change whenever necessary. 4. Migration will need to infer the runtime_type value from the schema name - which should be one-to-one. 5. I think we will want a uihint that can convert the type name to its value. For example, consider this portion of the catalog connector schema…

        "runtime_type": {
        "title": "Runtime Processor Type",
        "description": "The runtime type associated with this Component Catalog",
        "type": "string",
        "enum": ["KUBEFLOW_PIPELINES", "APACHE_AIRFLOW"],
        "uihints": {
          "field_type": "dropdown",
          "category": "Runtime",
          "value-map": {"KUBEFLOW_PIPELINES":"Kubeflow Pipelines", "APACHE_AIRFLOW":"Apache Airflow"}
        }
      },

it would be nice if the editor could look up the up-cased values in a value map kind of thing to use in the dropdown, and, similarly, set the up-cased value into the field once selected. I.e., only used the “displayable” value for display. We can get by with just the up-cased values, but its a better UX if we can display the displayable values.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
bourdakos1commented, Nov 3, 2021

runtime

Today we discussed persisting runtime type vs runtime processor information in the pipeline. We decided that eventually it would be best to only persist one or the other, but due to time restraints we will persist everything for now.

We were a bit unsure about which value to persist, but are leaning towards only persisting type. There are a couple of trade offs and assumptions made depending on which we choose.

persisting type:

  • assumes any pipeline of type APACHE_AIRFLOW can run with any processor (airflow, airflow-no-cos)
    • allows better portability, a user can still run the pipeline with the airflow processor if they don’t have airflow-no-cos installed
  • assumes the set of components available is fully dependent on type (APACHE_AIRFLOW)
  • only one tile APACHE_AIRFLOW available when creating pipeline
  • processor (airflow, airflow-no-cos) and config are chosen at submission time
    • note: processor doesn’t need to be explicitly chosen at submission (it could just show all three configs available)

persisting processor:

  • assumes a pipeline MUST run with persisted processor (airflow, airflow-no-cos)
  • component list could theoretically change depending on processor
    • airflow and airflow-no-cos could have different components available
  • two tiles (airflow and airflow-no-cos) available when creating pipeline
  • only config is chosen at submission time
2reactions
lresendecommented, Oct 28, 2021

Definition: The following uses the term runtime processor. This term refers to the platform or orchestration tool that is driving the execution of a given pipeline. Support for two common runtime processors is embedded in Elyra. These are Apache Airflow and Kubeflow, although others exist outside of Elyra they could be implemented using our BYO model.

I view the following as the drivers to runtime association:

  • Pipeline should have a “human understandable” value to define a runtime
    • This enables easy troubleshooting and the ability to read/update the pipeline file if necessary
  • The value on the pipeline is used to drive the discovery of which runtime processor to use
  • The processor serves as a facade/factory for other things related to a runtime
    • e.g. what catalog to use, etc

Problem: With the ability to bring your own runtimes and, as of #2241, bring your own component catalog connectors, it is important that we have the ability to specify that a given entity supports a type of runtime processor. Today, Elyra only defines runtime processor names. Although each instance of a PipelineProcessor has a type property, that property value is actually the name of a runtime configuration schema and not actually a type of runtime processor.

For example, Elyra ships with two runtime schema definitions - airflow and kfp. Runtime configuration instances of these schemas can be created, but each schema equates to a specific implementation of a RuntimePipelineProcessor (or RPP) instance (which is a subclass of the aforementioned PipelineProcessor class). However, if someone wanted to bring their own implementation of RuntimePipelineProcessor that also used Kubeflow to drive the execution of the pipeline, there really isn’t a way for that implementation to indicate it too is a Kubeflow-based processor similar to the processor named kfp.

I believe that most, if not all deployments, will focus on one runtime. In case there are multiple runtime processors that support a given runtime, I would focus on solving the problem via #2136 and only installing the desirable runtime processor that is desirable. Note that this would also enable users to continue to use the existing catalogs, etc as they shouldn’t be impacted by a different implementation of a “kfp” runtime processor.

Likewise, Component Catalog Connectors (or CCCs) want the ability to state that the components served from their implementation support a particular processor type, like Kubeflow or Apache Airflow, irrespective of how many RPP implementations are registered.

Agree on the Component Catalog connector parts, they are associated with a given runtime and not necessarily different if there are multiple runtime processor implementations available for a given runtime.

Having said that, I don’t think we should support ever having more than one live implementation on a deployment.

As a result, we need to formally introduce the notion of runtime processor type.

Proposal: The first issue to introducing a runtime processor type is its general conflict to the bring your own paradigm. How can we possibly enumerate types - which must be known at the time of development - when we don’t know what runtime processor implementations we will have until run time? This can be solved by first defining the set of known processor types, irrespective of there being an implementation of that type. This must be clearly conveyed to users. Just because a specific processor type is defined does NOT imply that such an implementation is available or that one even exists.

A quick google search for runtime processor types yields this result. Here we find six task/orchestration tools including the three we have some knowledge about (Apache Airflow, Kubeflow, and Argo) with the others being Luigi, Prefect, and MLFlow.

We can add some of these as development-time processor types with the idea that anyone wishing to introduce a runtime processor implementation (i.e., BYO RPP) of an undefined (unlisted) type, must first introduce a pull request adding that type to the enumeration.

Implementation (high-level): We’d like a central location that houses the various processor types. The type names should have string representations that can be referenced in schemas and the UI. Ideally, simple comparisons could be made without regard to case. It should be easy for users to introduce a new type prior to implementing their Pipeline Processor or Catalog Connector. Python’s Enum construct seems like a nice fit. In particular, we should use the @unique attribute to ensure values are unique.

Using the types referenced in the google search result, we’d have an enum like the following:

@unique
class RuntimeProcessorType(Enum):
    Local = 1
    Kubeflow = 2
    ApacheAirflow = 4
    Argo = 8
    Luigi = 16
    Prefect = 32
    MLFlow = 64

Opening a pipeline and seeing runtime=2 will not be very user-friendly and will require people to go look at the docs to figure out what runtime it maps to.

I also don’t like that people would need to change the code to introduce new runtimes (e.g. the proposed list already misses Tekton, Flyte, etc). This would be an issue if people have proprietary ones as well.

I also think that our UI is probably not ready to support multiple runtimes of the same type.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is runtime? | Definition from TechTarget
Runtime is a stage of the programming lifecycle. It is the time that a program is running alongside all the external instructions needed...
Read more >
Runtime system - Wikipedia
In computer programming, a runtime system or runtime environment is a sub-system that exists both in the computer where a program is created, ......
Read more >
Programming Embedded Systems, 2nd Edition [Book] - O'Reilly
This general-purpose processor was designed to read and execute a set of instructions—software—stored in an external memory chip. Intel's idea was that the ......
Read more >
ECMAScript proposal: Type Annotations - GitHub
This proposal aims to enable developers to add type annotations to their JavaScript code, allowing those annotations to be checked by a type...
Read more >
NET (and .NET Core) - introduction and overview
The CLR was designed to be a cross-platform runtime from its inception. It has been ported to multiple operating systems and architectures.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found