Proposal: WDL templates
See original GitHub issueThis came from discussions at OpenBio Winter Codefest around how to allow people to manage external resources, specifically driven by Spark.
Ideally, you want to write code once that does certain steps, e.g. manages Spark clusters (spin up, submit job(s), teardown) and reuse that generic code. This is supported by subworkflows by allowing you to reuse workflows. However, we also need the ability to inject some custom tasks into the generic workflow template, to customize it. Since the concept of the template would be WDL-specific, it would work for any externally managed resources. Cromwell does not need to be aware of Spark or any other similar resource, they all become user definable.
Additionally, you can think of other uses for templates that are not for managed resources: e.g. a template for variant calling that does: validate bam, run some form of variant calling, and then produce metrics. Someone could import the template and inject their custom variant calling workflow or task.
New keywords below are template
, contract
, inject
:
template
allows you to define a workflow where one or more contract
-satisfying tasks or workflows can be injected. contract
specifies the contract (a relatively generic bioinformatics analysis action, such as “alignment”) that defines inputs and outputs, but no other contents. inject
tells the template which implementation workflow or task to inject, in order to satisfy the contract
.
template.wdl:
template workflow variant_calling_template {
String in1
contract user_defined_variant_caller {
String input1
output {
String output1
}
}
# this template defines the workflow that is being run, in this case call
# task1, then task2, then the inner workflow, and finally task3
call task1{input: in1=in1}
call task2{input: in2=task1.out1}
call user_defined_variant_caller{input: input1=task2.out2}
call task3{input: in3=user_defined_variant_caller.output1}
output {
String out1 = task3.out1
}
}
Making use of a template in another workflow:
my_variant_caller.wdl:
import "template.wdl" as variant_calling_template_wdl
task my_variant_caller {
String input1
String input2
command { ... }
output { ... }
}
workflow my_variant_caller {
String in1
String in2
call variant_calling_template_wdl.variant_calling_template {
input: in1 = in1,
# NOTE: in trying to mock this out for a real example, I realized that you will
# often want to pass extra params to your task that the contract does
# not specify.
# I think the inject could work two ways:
# * only inject what the contact specifies:
# inject: user_defined_variant_caller = my_variant_caller
# * pass extra arguments beyond the contract as shown below (the contract
# arguments also fulfilled):
# inject: user_defined_variant_caller = my_variant_caller(input: input2=in2)
inject: user_defined_variant_caller = my_variant_caller(input: input2=in2)
}
output {
String out1 = variant_calling_template.out1
}
}
An example I tried out using this and https://github.com/openwdl/wdl/issues/183 for porting the hail wdl task I’ve been working on (https://github.com/broadinstitute/firecloud-tools/blob/ab_hail_wdl/scripts/hail_wdl_test/hail_test_cleanup.wdl) to this proposed syntax
############################
Template file contents:
task start_cluster {
# TODO: a struct later would be much easier
Map[String, String] dataproc_cluster_specs
command { ... spin up cluster here ... }
output {
cluster_name = "name of cluster made"
cluster_staging_bucket = "path to cluster's staging bucket"
}
}
task delete_cluster {
String cluster_name
command { ... delete cluster here ... }
output {}
}
# template workflow for running pyspark on dataproc
template workflow pyspark_dataproc {
# TODO: a struct later would be much easier
Map[String, String] dataproc_cluster_specs
String cluster_name
resources dataproc_cluster_manager {
before start_cluster{input: dataproc_cluster_specs=dataproc_cluster_specs}
call submit_job{cluster_name=cluster_name, cluster_staging_bucket=start_cluster.cluster_staging_bucket}
after delete_cluster{input: cluster_name=cluster_name}
}
}
############################
# User workflow file contents:
import "pyspark_dataproc.wdl" as pyspark_dataproc_wdl
# user defined task
task submit_job {
String cluster_name
String cluster_staging_bucket
File hailCommandFile
String inputVds
String inputAnnot
File outputVdsFileName
File qcResultsFileName
command { ... submit to the cluster and output to cluster staging bucket ... }
}
# workflow that uses template
workflow submit_hail_job {
String cluster_name
String cluster_staging_bucket
File hailCommandFile
String inputVds
String inputAnnot
File outputVdsFileName
File qcResultsFileName
call pyspark_dataproc_wdl.pyspark_dataproc {
input: dataproc_cluster_specs = {"master_machine_type":"n1-standard-8", "master_machine_disk": 100},
inject: submit_job = submit_job(input: cluster_name=cluster_name, cluster_staging_bucket=cluster_staging_bucket,
hailCommandFile=hailCommandFile, inputVds=inputVds, inputAnnot=inputAnnot,
outputVdsFileName=outputVdsFileName, qcResultsFileName=qcResultsFileName)
}
output {
String out1 = variant_calling_template.out1
}
}
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Top GitHub Comments
I totally agree about the other topics being more important, this was more to get discussion started about it.
I don’t quite understand the reproducibility or injection at runtime points though, so maybe I’m missing something there. The way I was picturing this is not very different from imports now. Imports already effect readability in that you need to look at all files to see what’s actually happening. In the case I outlined the import is effectively flipped and the template ends up importing the injected workflow or task. In the end you could flatten both into a single file and get the same results. Templates as I defined them are totally not needed, but it means you just need to add more explicit boilerplate to your workflows. You can always do a lot of copy paste and get the same results.
Perhaps there is a different way to express this that looks less polymorphic, because that was not the original intention. Right now if you import a task and use it, you satisfy the “contract” of that task or workflow by explicitly matching all of it’s inputs and outputs, and the contract as I specified was intended for the same purpose by showing what the task or workflow would need to look like in order to be effectively imported into the template.
@cjllanwarne updated with the expanded argument syntax