Proposal: Release artifact build and import process
See original GitHub issueBackground
There are two models of building content: “push” and “pull”. In a “push” model, the user builds an artifact (e.g., software package, content archive, container image, etc.) locally, and pushes it to a content server. In a “pull” model, the content server downloads or pulls the source code, and builds the artifact for the user. In both models, there are defined procedures, formats, metadata, and supporting tooling to aid in producing a release artifact.
Most popular content services use a “push” model, including: PyPi (Python packages), Crates.io (Rust packages), and NPM (Node.JS packages). For these services, the content creator transforms the source code into a package artifact, and takes on the responsibility of testing, building, and pushing the artifact to the content server.
In rare cases content services take on the process of building artifacts. Docker Hub is one such example, where a content creator is able to configure an automated build process. The build process is triggered by a notification from a source code hosting service (i.e., GitHub or Bitbucket), when new code is merged. In response to the notification, Docker Hub downloads the new code, and generates a new image.
Problem Description
The Galaxy import process works as a “pull” model that can be initiated manually via the Galaxy website, or triggered automatically via a webhook from the Travis CI platform. However, unlike other content services, Galaxy does not enforce an artifact format, does not provide a specification for artifact metadata, and does not provide tooling to aid in building a release artifacts.
When it comes to versioning content, Galaxy relies on git tags stored in the source code hosting service (GitHub). These tags point to a specific commit within the source code history. Each tag represents a point in time within the source code lifecycle, and is only useful within the context of a git repository. Removing the source code from the repository and placing it in an artifact causes the git tags to be lost, and with it any notion of the content version.
Galaxy provides no concept of repository level metadata, where information such as a version number, name and namespace might be located and associated with a release artifact. Metadata is currently only defined at the content level. For example, Ansible roles contain metadata stored in a meta/main.yml
file, and modules contain metadata within their source code. Combine multiple content items and types into a single release artifact, and the metadata becomes ambiguous.
The Galaxy import process does not look for a release artifact, but instead clones the GitHub repository, and inspects the local clone. This means that any notion of content version it discovers and records comes directly from git tags. It’s not able to detect when a previously recorded version of the content has been altered, nor is it able to help an end user verify that the content being downloaded is the expected content. It’s also not able to inspect and test release artifacts, and therefore can offer no assurances to the end user of the content.
Since it doesn’t interact with release artifacts, as you might expect, Galaxy offers no prescribed process and procedures for creating a release archive, nor does it offer any tooling to assist in the creation a release archive. The good news is, Galaxy is a blank canvas in this regard.
Proposed Solution
Define repository metadata and build manifest
A repository metadata file, galaxy.toml, will be placed at the root of the project directory tree, and contain information such as: author, license, name, namespace, etc. It will hold any attributes required to create a release artifact from the repository source tree.
The archive build process (defined later) will package the repository source contents (e.g., roles, modules, plugins, etc.), and generate a build manifest file. The generated manifest file will contain the metadata found in galaxy.yml, plus information about the package structure and contents, and information about the release, including the version number.
The generated manifest file will be a JSON formatted file called METADATA that will be added to the root of the release artifact during the build process. Consumers of the release artifact, such as the Galaxy CLI, and the Galaxy import process, will be able to read the manifest file, and verify information about the release and its contents.
Enable Mazer to build packages
Given a defined package structure and a process for building a release artifact, it makes since to build the necessary components into Mazer that automate the artifact build process.
Use GitHub Releases as content storage
GitHub Releases will be the mechanism for storing and sharing release archives. GitHub provides an API that can be used by CI platforms and Mazer to push release artifacts to GitHub.
Mazer will be extended with the ability to push a release artifact to GitHub. This provides a single, consistent method for content creators to automate release pushes that can be called from any CI platform.
Notify the Galaxy server when new release artifacts are available
On the Galaxy server, add the ability for users to generate an API token that can be used by clients, such as Mazer, to authenticate with the API.
Extend Mazer with the ability to trigger an import process. Mazer will authenticate with the API via a user’s API token, and trigger an import of the newly available release.
Verify release artifacts
Enable Mazer to verify the integrity of release artifacts downloaded from GitHub at the time of installation.
There are several solutions widely used for verifying the integrity of a downloaded artifact, including checksums and digital signatures. In general, a checksum guarantees integrity, but not authenticity. A digital signature guarantees both integrity and authenticity.
Using a digital signature for user content requires a complex process of maintaining a trusted keychain, and still does not guarantee perfect authenticity. Since release artifacts are not hosted by Galaxy, but rather by a third party, it’s impossible to perfectly guarantee authenticity.
However, since Galaxy is a centralized packages index, and data transfer between the Galaxy server and client is secured via TLS encryption, Galaxy can be considered a trusted source of metadata, and integrity verification can be achieved by storing release artifact checksums on the Galaxy server.
During import of a repository, Galaxy will store metadata, including the checksum, for a specific content version only once. Any subsequent updates to a version will be prohibited.
Import workflow.
- Using Mazer, user triggers an import of a repository, passing the URL of the new release
- Galaxy downloads the release artifact, calculates a checksum, and stores the checksum along with additional metadata about the release
- Any subsequent updates of already imported package are prohibited.
Install Workflow
- User executes
mazer install
command to install an Ansible collection - Mazer downloads package metadata from Galaxy, which includes the download URL and checksum.
- Mazer downloads the release artifact
- Mazer calculates checksum of downloaded package, and compares it with checksum received from Galaxy
Issue Analytics
- State:
- Created 5 years ago
- Comments:23 (19 by maintainers)
I have some concerns regarding overall direction of this discussion.
I mostly agree with @cutwater. But anyways I want to add 2c.
Traditional way to organise repos
At the moment, it’s normal to organise repositories in the most simple and reliable way. For example PyPI, NPM, RPM, Maven (and many others) essentially decompose the packages into folders on the file system and create metadata for all packages. Where package is just an archive like tarball file. This way is reliable and does not produce any problems. This scheme works for years. Attempts to do something more optimal or tricky leads to problems like DEB repos have, when repo is inconsistent during updates. Plus all these repo engines usually have some web UI with a search engine.
Here is direction you go in your discussion.
You propose to build a repository on the top of a distributed virtual file storage system, where you are not responsible for either consistency or data accessibility. You have limited ACLs for this storage and you may lost access at any time. You will eventually improve and evolve this system. In some time you’ll find this storage to contain lots of packages in outdated format which are stored over lots of different storage backends like github, gitlab, nexus, custom web servers, amazon, google… This future doesn’t look very cool. It’s much more complicated than DEB repos. Something will definitely go wrong.
You also need to consider these questions:
My proposal
Wrap all these things with python setuptools and distribute them as python packages. Use PyPI or your own repo or both. Don’t reinvent the wheel. So many people will be thankful for this simple solution!
Following up on the discussion with @daviddavis, @bmbouter, @alikins and @cutwater…
We decided the following:
push github
command. There’s no need for Mazer to push directly to GitHub.publish
to Galaxy. Galaxy will publish the artifact to GitHub, and possibly in the future, store a copy of the archive in Pulp or similar service.publish
command will perform a multi-part upload of the file to a Galaxy API endpointJust to be clear, we’re not forcing contributors to use this process day one. Galaxy will continue to support the existing import process that relies only on GitHub repositories. This new process will be optional. Consider it the first phase in moving Galaxy toward hosting content.