question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add support for Github-based component registries

See original GitHub issue

Is your feature request related to a problem? Please describe. #2083 set the groundwork for supporting component registries that include multiple components from a single source type (e.g. a directory containing .yaml or .py files). We also would like to support registries that contain component definitions from a Github repo.

Describe the solution you’d like Build out the support for searching through a Github repo for component definitions.

Considerations We will need to figure out how to discriminate between, e.g. files that are component definitions versus files (of the same type) that are not component definitions.

This article may be of use in designing a solution. We may also want to consider using the GitHub API.

Design

The structure of the component registry already has laid the groundwork to support GitHub-based repos and already includes the GitHubComponentReader class, which derives from UrlComponentReader. Only one class method would need to be updated: get_absolute_locations(). Each Reader class has such a method to break potentially multi-valued locations down into their constituent parts. For the GitHub reader, this method will take the list of paths to GitHub repo(s) given in the registry instance metadata and will return a list of paths to each component specification file within that registry.

I believe the lightest-weight implementation of this might include a single call to the GitHub API, specifically of the format:

https://api.github.com/[owner_name]/[repo_name]/contents

Here is the response from the call to a sample component registry repo with 2 component definitions:

[
  {
    "name": "pig_operator.py",
    "path": "pig_operator.py",
    "sha": "499161d1fac3df3f36743630c7799ba4a6aeb250",
    "size": 2707,
    "url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/pig_operator.py?ref=main",
    "html_url": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/pig_operator.py",
    "git_url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/499161d1fac3df3f36743630c7799ba4a6aeb250",
    "download_url": "https://raw.githubusercontent.com/kiersten-stokes/component-registries-airflow/main/pig_operator.py",
    "type": "file",
    "_links": {
      "self": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/pig_operator.py?ref=main",
      "git": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/499161d1fac3df3f36743630c7799ba4a6aeb250",
      "html": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/pig_operator.py"
    }
  },
  {
    "name": "sqllite_operator.py",
    "path": "sqllite_operator.py",
    "sha": "fb4a30e350250359d357fed87525fb6d167b756b",
    "size": 2037,
    "url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/sqllite_operator.py?ref=main",
    "html_url": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/sqllite_operator.py",
    "git_url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/fb4a30e350250359d357fed87525fb6d167b756b",
    "download_url": "https://raw.githubusercontent.com/kiersten-stokes/component-registries-airflow/main/sqllite_operator.py",
    "type": "file",
    "_links": {
      "self": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/sqllite_operator.py?ref=main",
      "git": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/fb4a30e350250359d357fed87525fb6d167b756b",
      "html": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/sqllite_operator.py"
    }
  }
]

The download_url value is what is of interest to us. As with the directory-based registries, only files with the correct file extension for that type of runtime processor (.py for Airflow and .yaml for KFP currently) will be considered. As usual, any files that cannot be successfully parsed for one reason or another are logged and skipped (outside of the get_absolute_locations method).

Limitations:

  • Requires the repos in question to be public
    • i.e. supporting private repos would necessitate adding authentication metadata fields to each repo path given in the paths array of the registry
    • This would be nice to support eventually, but won’t make 3.2.0
  • Requires the user enter the correct repo URL(s) (e.g., https://github.com/[owner_name]/[repo_name])
    • This isn’t any different than our current requirements for other url-based registries though
  • This doesn’t necessarily require that the repo contains only component specification files, but we will definitely want to test to ensure that if non-component specs are picked up for parsing, that we are catching it as early as possible and skipping the bulk of the parse/not throwing errors

I’m open to other ideas for looping through repo files to get content to parse. I think the API makes a lot of sense because it’s an easy implementation (only one request per path entry) and keeps things url-based as they should be for a remote resource location. Based on my cursory research, I also don’t think other methods would alleviate the limitations cited above for this method.

Questions:

  • I’m assuming we will want to also check subdirectories for yamls/operators as well as opposed to requiring a flat structure? This should’t be too difficult because the value of the type key in the GitHub API response will include subdir for any folders

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ptitzlercommented, Sep 22, 2021

Since users can already specify public web resources as a source, non-bulk load scenarios are already somewhat supported. Therefore it might be better to defer until we have a better understanding how (KFP/AA) users currently manage component specifications.

0reactions
Ark-kuncommented, Jun 6, 2022

I have raised https://github.com/kubeflow/pipelines/issues/7832, to propose that KFP natively adds component_id and component_version to the Component YAML spec, if you want to comment there.

TBH, I’m not sure whether this would be a measurable improvement over having this information in annotations. The idea of annotations is to provide a pathway for extensibility while maintaining backward and forward compatibility. It can be used as an experimental playground while the tools are being tested.

Note how canonical_location was added without changing the ComponentSpec schema. Without breaking old or new users. Additionally, I’m not fully sure how the component_version would be handled in a world where component file can be forked and changed. E.g. what happens when someone forks the component and makes some change, but does not update the version. What if they increase the version a lot? This is why I’ve used the canonical_location wording. Canonical location points to a repo and branch where the whole component lineage can be discovered and the latest version can be obtained. It also allows changing the “component ID” - the latest component yaml file will have canonical_location pointing to another directory/repo.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Working with the npm registry - GitHub Docs
You can configure npm to publish packages to GitHub Packages and to use packages stored on GitHub Packages as dependencies in an npm...
Read more >
jhsware/component-registry - GitHub
The purpose of component-registry is to help you create reusable components that are easy to extend and customise. It is heavily inspired by...
Read more >
Introduction to GitHub Packages
GitHub Packages offers different package registries for commonly used package managers, such as npm, RubyGems, Apache Maven, Gradle, Docker, and NuGet. GitHub's ......
Read more >
Explore support for 3rd party component registry/machine ...
We need to explore if the existing Elyra component registry implementation is. ... Add support for Github-based component registries #2139.
Read more >
Publishing and installing a package with GitHub Actions
Registries that support granular permissions allow users to create and administer packages as free-standing resources at the organization level. Packages can be ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found