enable batch execution of workflows
See original GitHub issueOne of the goals of renku workflows is to allow a user to develop a robust, working pipeline by iterating quickly on a dataset locally (laptop, interactive session) and then send that workflow to a more capable resource to run on a bigger dataset or parameters requiring extra compute power.
A significant goal of the workflow KG representation was to allow for serialization of workflow information into other formats. At the moment only the common workflow language (CWL) is supported, but the same methods used to create CWL files can be extended to other workflow languages. One limitation of CWL is that there doesn’t seem to be good support for running these workflows either on kubernetes or on HPC systems.
The goal of this epic is to serve as a roadmap for the implementation of a) the supporting devops infrastructure and b) the required code changes for a simple PoC of batch/remote workflow execution.
General use-case
A simple use-case might look something like this:
# develop the workflow
renku run <step1> <dataset>
renku run <step2> <dataset>
# update the dataset
renku dataset update --all
# run update remotely to avoid expensive local calculation
renku update --remote=cloud --all
# or use rerun to specify different parameters
renku rerun --edit-inputs --remote=cloud
The last two steps are identical to what the user can do now, except that they would run in the kubernetes cluster. The steps should be sent to the workflow engine as a DAG expressed in whatever workflow language is needed by the backend. Some steps might run in parallel. Once all the steps have completed, they should push the changes back to the repo, just like a user would do if running those commands locally.
An analogous flow might be envisioned from the web UI, where the page showing the overview of the project’s assets might inform the user that some workflow outputs are out of date and give the option to update them automatically.
Issues to consider
There are several issues to consider (in no particular order of importance):
- serialization to a different workflow format/language/syntax
- minimize data i/o - ideally, the data required for the calculations would be pulled once and shared between the steps - on kubernetes this poses a potential difficulty for parallel steps because of issues around multi-attach volumes
- ability to run steps in parallel and combine automatically before a dependent step
- UX of configuring remote resources and setting defaults - for example, a default could be to run on the same cluster as the renkulab instance, but a user may choose to specify a custom resource using a standard interface (e.g. an HPC cluster via ssh and Slurm or LFS)
- providing some feedback about the status of the remote execution, especially access to error logs
- the remote workflow should run in a docker container that includes all of the software dependencies specified in the project - we should consider making “batch” docker images that don’t include the entire jupyter stack to minimize the time it takes to launch containers
Building a PoC
The result of this epic should be a general architecture for running remote workflows and a PoC that implements the architecture for some subset of the above functionality using a specific workflow backend. One obvious choice for kubernetes is Argo workflows. Other potential options:
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:6 (5 by maintainers)
Thanks @Panaetius - you’re absolutely right, I forgot to add nextflow to the list - will do so now.
Regarding the command semantics - yes you’re right, I was being a bit too myopic. We definitely need to support a different kind of command here -
workflow execute
?Here we could allow for seamlessly using workflow templates from other projects (or even other instances), e.g.
Running it without a parameter list would prompt you for whatever inputs need to be specified. This starts to bleed a bit into https://github.com/SwissDataScienceCenter/renku-python/issues/1553 and probably other open issues.
re: async or sync: Ideally it would be possible to do this async with the special case where you want to wait for completion. Since it’s to be used from the UI async needs to be supported but maybe starting with sync mode would be sufficient for the PoC.
RenkuLab use case for workflow execution on HPC: iSEE Dashboard Data
Context This repository contains 3 workflows to fetch single cell omics data from the external sources:
Each workflow consists of a single task executing Rscript with Rmd file input (folder
processing_scripts
) that produces a processed data file in rds format and optionally a configuration file in R format. A dataset is then created with the outputs of each workflow. The commands for these steps are listed increate_datasets_code.sh
Problem While the first 2 workflows run smoothly and produce the desired output, the 3rd workflow fetching bigger single cell tumor data GSE62944 from ExperimentHub runs out of resources available on RenkuLab instance. Typical single cell omics are of such or bigger size (including benchmarking datasets planned for OMNIBENCHMARK system). There is a need to execute the workflows that fetch large omics data on more powerful HPC compute resources and bring back the resulting processed files and corresponding metadata.
Desired solution Execute the workflow created with the command below on a remote compute resources. Expected output: