Add support for Github-based component registries
See original GitHub issueIs your feature request related to a problem? Please describe.
#2083 set the groundwork for supporting component registries that include multiple components from a single source type (e.g. a directory containing .yaml
or .py
files). We also would like to support registries that contain component definitions from a Github repo.
Describe the solution you’d like Build out the support for searching through a Github repo for component definitions.
Considerations We will need to figure out how to discriminate between, e.g. files that are component definitions versus files (of the same type) that are not component definitions.
This article may be of use in designing a solution. We may also want to consider using the GitHub API.
Design
The structure of the component registry already has laid the groundwork to support GitHub-based repos and already includes the GitHubComponentReader
class, which derives from UrlComponentReader
. Only one class method would need to be updated: get_absolute_locations()
. Each Reader
class has such a method to break potentially multi-valued locations down into their constituent parts. For the GitHub reader, this method will take the list of paths to GitHub repo(s) given in the registry instance metadata and will return a list of paths to each component specification file within that registry.
I believe the lightest-weight implementation of this might include a single call to the GitHub API, specifically of the format:
https://api.github.com/[owner_name]/[repo_name]/contents
Here is the response from the call to a sample component registry repo with 2 component definitions:
[
{
"name": "pig_operator.py",
"path": "pig_operator.py",
"sha": "499161d1fac3df3f36743630c7799ba4a6aeb250",
"size": 2707,
"url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/pig_operator.py?ref=main",
"html_url": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/pig_operator.py",
"git_url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/499161d1fac3df3f36743630c7799ba4a6aeb250",
"download_url": "https://raw.githubusercontent.com/kiersten-stokes/component-registries-airflow/main/pig_operator.py",
"type": "file",
"_links": {
"self": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/pig_operator.py?ref=main",
"git": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/499161d1fac3df3f36743630c7799ba4a6aeb250",
"html": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/pig_operator.py"
}
},
{
"name": "sqllite_operator.py",
"path": "sqllite_operator.py",
"sha": "fb4a30e350250359d357fed87525fb6d167b756b",
"size": 2037,
"url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/sqllite_operator.py?ref=main",
"html_url": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/sqllite_operator.py",
"git_url": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/fb4a30e350250359d357fed87525fb6d167b756b",
"download_url": "https://raw.githubusercontent.com/kiersten-stokes/component-registries-airflow/main/sqllite_operator.py",
"type": "file",
"_links": {
"self": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/contents/sqllite_operator.py?ref=main",
"git": "https://api.github.com/repos/kiersten-stokes/component-registries-airflow/git/blobs/fb4a30e350250359d357fed87525fb6d167b756b",
"html": "https://github.com/kiersten-stokes/component-registries-airflow/blob/main/sqllite_operator.py"
}
}
]
The download_url
value is what is of interest to us. As with the directory-based registries, only files with the correct file extension for that type of runtime processor (.py
for Airflow and .yaml
for KFP currently) will be considered. As usual, any files that cannot be successfully parsed for one reason or another are logged and skipped (outside of the get_absolute_locations
method).
Limitations:
- Requires the repos in question to be public
- i.e. supporting private repos would necessitate adding authentication metadata fields to each repo path given in the
paths
array of the registry - This would be nice to support eventually, but won’t make 3.2.0
- i.e. supporting private repos would necessitate adding authentication metadata fields to each repo path given in the
- Requires the user enter the correct repo URL(s) (e.g.,
https://github.com/[owner_name]/[repo_name]
)- This isn’t any different than our current requirements for other url-based registries though
- This doesn’t necessarily require that the repo contains only component specification files, but we will definitely want to test to ensure that if non-component specs are picked up for parsing, that we are catching it as early as possible and skipping the bulk of the parse/not throwing errors
I’m open to other ideas for looping through repo files to get content to parse. I think the API makes a lot of sense because it’s an easy implementation (only one request per path entry) and keeps things url-based as they should be for a remote resource location. Based on my cursory research, I also don’t think other methods would alleviate the limitations cited above for this method.
Questions:
- I’m assuming we will want to also check subdirectories for yamls/operators as well as opposed to requiring a flat structure? This should’t be too difficult because the value of the
type
key in the GitHub API response will includesubdir
for any folders
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Since users can already specify public web resources as a source, non-bulk load scenarios are already somewhat supported. Therefore it might be better to defer until we have a better understanding how (KFP/AA) users currently manage component specifications.
TBH, I’m not sure whether this would be a measurable improvement over having this information in
annotations
. The idea ofannotations
is to provide a pathway for extensibility while maintaining backward and forward compatibility. It can be used as an experimental playground while the tools are being tested.Note how
canonical_location
was added without changing the ComponentSpec schema. Without breaking old or new users. Additionally, I’m not fully sure how the component_version would be handled in a world where component file can be forked and changed. E.g. what happens when someone forks the component and makes some change, but does not update the version. What if they increase the version a lot? This is why I’ve used thecanonical_location
wording. Canonical location points to a repo and branch where the whole component lineage can be discovered and the latest version can be obtained. It also allows changing the “component ID” - the latest component yaml file will havecanonical_location
pointing to another directory/repo.