Proposal: WDL resources block
See original GitHub issueThis came from discussions at OpenBio Winter Codefest around how to allow people to manage computational resources, specifically driven by Spark (or grid engine, or whatever else).
When using an external resource like Dataproc to run Spark jobs on clusters, you need to do some management of those clusters during the lifetime of your workflow. In some cases you might want one cluster per task, or you might want to reuse the same cluster across multiple tasks. In order to do this, this proposal includes a way to have a before & after to help with this management.
New keywords below are before
, after
, and resources
.
before
is a callable that is called before any of the call
s in a workflow. after
is a callable that is called after all the call
s are complete in a workflow, and is guaranteed to be called as long as before
succeeds, regardless of whether continueWhilePossible
is used or not. resources
is a block that contains exactly one before
, one after
and one or more call
inside of it. resources
is added so that one workflow can have more than one set of before
and after
, e.g. to run tasks in parallel on different spark clusters. before
and after
are used in the same way call
is used.
workflow my_variant_caller {
String cluster_size
String bam
resources {
before before_task{input: cluster_size=cluster_size}
call my_task1{input: cluster_name=before_task.cluster_name, bam=bam}
after after_task{input: cluster_name=before_task.cluster_name}
}
output {
job_output=my_task1.spark_output
}
}
An example I tried out using this and https://github.com/openwdl/wdl/issues/182 for porting the hail wdl task I’ve been working on (https://github.com/broadinstitute/firecloud-tools/blob/ab_hail_wdl/scripts/hail_wdl_test/hail_test_cleanup.wdl) to this proposed syntax
############################
Template file contents:
task start_cluster {
# TODO: a struct later would be much easier
Map[String, String] dataproc_cluster_specs
command { ... spin up cluster here ... }
output {
cluster_name = "name of cluster made"
cluster_staging_bucket = "path to cluster's staging bucket"
}
}
task delete_cluster {
String cluster_name
command { ... delete cluster here ... }
output {}
}
# template workflow for running pyspark on dataproc
template workflow pyspark_dataproc {
# TODO: a struct later would be much easier
Map[String, String] dataproc_cluster_specs
String cluster_name
resources dataproc_cluster_manager {
before start_cluster{input: dataproc_cluster_specs=dataproc_cluster_specs}
call submit_job{cluster_name=cluster_name, cluster_staging_bucket=start_cluster.cluster_staging_bucket}
after delete_cluster{input: cluster_name=cluster_name}
}
}
############################
# User workflow file contents:
import "pyspark_dataproc.wdl" as pyspark_dataproc_wdl
# user defined task
task submit_job {
String cluster_name
String cluster_staging_bucket
File hailCommandFile
String inputVds
String inputAnnot
File outputVdsFileName
File qcResultsFileName
command { ... submit to the cluster and output to cluster staging bucket ... }
}
# workflow that uses template
workflow submit_hail_job {
String cluster_name
String cluster_staging_bucket
File hailCommandFile
String inputVds
String inputAnnot
File outputVdsFileName
File qcResultsFileName
call pyspark_dataproc_wdl.pyspark_dataproc {
input: dataproc_cluster_specs = {"master_machine_type":"n1-standard-8", "master_machine_disk": 100},
inject: submit_job = submit_job(input: cluster_name=cluster_name, cluster_staging_bucket=cluster_staging_bucket,
hailCommandFile=hailCommandFile, inputVds=inputVds, inputAnnot=inputAnnot,
outputVdsFileName=outputVdsFileName, qcResultsFileName=qcResultsFileName)
}
output {
String out1 = variant_calling_template.out1
}
}
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
I dont think we specifically landed anywhere. My impression is that this goes against the goal of abstraction of WDL from the exectution environment. Its possible this logic could be put into the
hints
section, but I would not be praticularly keen on voting this forward myselfRight. Closing this issue with the recommendation that if someone cares very very strongly they can open a new proposal working out how this would work as a hints-based thing.