Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

enable batch execution of workflows

See original GitHub issue

One of the goals of renku workflows is to allow a user to develop a robust, working pipeline by iterating quickly on a dataset locally (laptop, interactive session) and then send that workflow to a more capable resource to run on a bigger dataset or parameters requiring extra compute power.

A significant goal of the workflow KG representation was to allow for serialization of workflow information into other formats. At the moment only the common workflow language (CWL) is supported, but the same methods used to create CWL files can be extended to other workflow languages. One limitation of CWL is that there doesn’t seem to be good support for running these workflows either on kubernetes or on HPC systems.

The goal of this epic is to serve as a roadmap for the implementation of a) the supporting devops infrastructure and b) the required code changes for a simple PoC of batch/remote workflow execution.

General use-case

A simple use-case might look something like this:

# develop the workflow
renku run <step1> <dataset>
renku run <step2> <dataset>

# update the dataset
renku dataset update --all

# run update remotely to avoid expensive local calculation
renku update --remote=cloud --all

# or use rerun to specify different parameters
renku rerun --edit-inputs --remote=cloud

The last two steps are identical to what the user can do now, except that they would run in the kubernetes cluster. The steps should be sent to the workflow engine as a DAG expressed in whatever workflow language is needed by the backend. Some steps might run in parallel. Once all the steps have completed, they should push the changes back to the repo, just like a user would do if running those commands locally.

An analogous flow might be envisioned from the web UI, where the page showing the overview of the project’s assets might inform the user that some workflow outputs are out of date and give the option to update them automatically.

Issues to consider

There are several issues to consider (in no particular order of importance):

serialization to a different workflow format/language/syntax
minimize data i/o - ideally, the data required for the calculations would be pulled once and shared between the steps - on kubernetes this poses a potential difficulty for parallel steps because of issues around multi-attach volumes
ability to run steps in parallel and combine automatically before a dependent step
UX of configuring remote resources and setting defaults - for example, a default could be to run on the same cluster as the renkulab instance, but a user may choose to specify a custom resource using a standard interface (e.g. an HPC cluster via ssh and Slurm or LFS)
providing some feedback about the status of the remote execution, especially access to error logs
the remote workflow should run in a docker container that includes all of the software dependencies specified in the project - we should consider making “batch” docker images that don’t include the entire jupyter stack to minimize the time it takes to launch containers

Building a PoC

The result of this epic should be a general architecture for running remote workflows and a PoC that implements the architecture for some subset of the above functionality using a specific workflow backend. One obvious choice for kubernetes is Argo workflows. Other potential options:

nextflow (could be used in both HPC and k8s in principle)
snakemake
toil
cwl-WES (also funnel and tesk)

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:6 (5 by maintainers)

Top GitHub Comments

2reactions

rokroskarcommented, Feb 4, 2021

Thanks @Panaetius - you’re absolutely right, I forgot to add nextflow to the list - will do so now.

Regarding the command semantics - yes you’re right, I was being a bit too myopic. We definitely need to support a different kind of command here - workflow execute?

renku run ... <output>
renku workflow create --name <workflow> <output>
renku workflow execute --remote=cloud <workflow> <parameter-list?>

Here we could allow for seamlessly using workflow templates from other projects (or even other instances), e.g.

renku workflow execute --remote=cloud http://renkulab.io/workflows/<id>

Running it without a parameter list would prompt you for whatever inputs need to be specified. This starts to bleed a bit into https://github.com/SwissDataScienceCenter/renku-python/issues/1553 and probably other open issues.

re: async or sync: Ideally it would be possible to do this async with the special case where you want to wait for completion. Since it’s to be used from the UI async needs to be supported but maybe starting with sync mode would be sufficient for the PoC.

1reaction

ksanaocommented, Apr 30, 2021

RenkuLab use case for workflow execution on HPC: iSEE Dashboard Data

Context This repository contains 3 workflows to fetch single cell omics data from the external sources:

PBMC data from 10X Genomics,
dataset of 379 mouse brain cells from Tasic et al. 2016 (ReprocessedAllenData)
single cell tumor data GSE62944 from ExperimentHub

Each workflow consists of a single task executing Rscript with Rmd file input (folder processing_scripts) that produces a processed data file in rds format and optionally a configuration file in R format. A dataset is then created with the outputs of each workflow. The commands for these steps are listed in create_datasets_code.sh

Problem While the first 2 workflows run smoothly and produce the desired output, the 3rd workflow fetching bigger single cell tumor data GSE62944 from ExperimentHub runs out of resources available on RenkuLab instance. Typical single cell omics are of such or bigger size (including benchmarking datasets planned for OMNIBENCHMARK system). There is a need to execute the workflows that fetch large omics data on more powerful HPC compute resources and bring back the resulting processed files and corresponding metadata.

Desired solution Execute the workflow created with the command below on a remote compute resources. Expected output:

processed_data/sce-tcga-gse62944-isee.rds
processed_data/sce-tcga-gse62944-isee.h5

renku run --name "sce-tcga-gse62944-isee" --input "processing_scripts/process-tcga-gse62944.Rmd" Rscript -e "rmarkdown::render('processing_scripts/process-tcga-gse62944.Rmd')"

Top Results From Across the Web

Run a Batch job using Workflows - Google Cloud

Batch is a fully managed service that lets you schedule, queue, and execute batch processing workloads on Compute Engine virtual machine (VM) instances....

Configure the Workflow Message Processing batch job as ...

Select the Workflow message processing batch job. Click Edit in the action pane. Select the Critical Job check box. Click Save in the...

What is the basic workflow for Batch Processing

Batch Processing is a tool that analyzes multiple TEST objects in a specified directory and executes a procedure on each.

Configuring Batch Processing

If Siebel CRM must run a workflow process for every record of a business component, then you can configure the Workflow Process Batch...

Implementing and Automating Batch Processing Methods

FME Server Schedules, for example, make it easy for automating batch processing workflows that need to run on a scheduled basis. Once you've ......