Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: WDL resources block

See original GitHub issue

This came from discussions at OpenBio Winter Codefest around how to allow people to manage computational resources, specifically driven by Spark (or grid engine, or whatever else).

When using an external resource like Dataproc to run Spark jobs on clusters, you need to do some management of those clusters during the lifetime of your workflow. In some cases you might want one cluster per task, or you might want to reuse the same cluster across multiple tasks. In order to do this, this proposal includes a way to have a before & after to help with this management.

New keywords below are before, after, and resources.

before is a callable that is called before any of the calls in a workflow. after is a callable that is called after all the calls are complete in a workflow, and is guaranteed to be called as long as before succeeds, regardless of whether continueWhilePossible is used or not. resources is a block that contains exactly one before, one after and one or more call inside of it. resources is added so that one workflow can have more than one set of before and after, e.g. to run tasks in parallel on different spark clusters. before and after are used in the same way call is used.

workflow my_variant_caller {
    String cluster_size
    String bam
 
    resources {
        before before_task{input: cluster_size=cluster_size}
        call my_task1{input: cluster_name=before_task.cluster_name, bam=bam}
        after after_task{input: cluster_name=before_task.cluster_name}
    }

    output {
        job_output=my_task1.spark_output
    }
}

An example I tried out using this and https://github.com/openwdl/wdl/issues/182 for porting the hail wdl task I’ve been working on (https://github.com/broadinstitute/firecloud-tools/blob/ab_hail_wdl/scripts/hail_wdl_test/hail_test_cleanup.wdl) to this proposed syntax

############################
Template file contents:

task start_cluster {
  # TODO: a struct later would be much easier
  Map[String, String] dataproc_cluster_specs

  command { ... spin up cluster here ... }

  output {
    cluster_name = "name of cluster made"
    cluster_staging_bucket = "path to cluster's staging bucket"
  }
}

task delete_cluster {
  String cluster_name

  command { ... delete cluster here ... }

  output {}
}

# template workflow for running pyspark on dataproc
template workflow pyspark_dataproc {
   # TODO: a struct later would be much easier
   Map[String, String] dataproc_cluster_specs

   String cluster_name
   
   resources dataproc_cluster_manager {
        before start_cluster{input: dataproc_cluster_specs=dataproc_cluster_specs}
        call submit_job{cluster_name=cluster_name, cluster_staging_bucket=start_cluster.cluster_staging_bucket}
        after delete_cluster{input: cluster_name=cluster_name}
   }
}

############################
# User workflow file contents:

import "pyspark_dataproc.wdl" as pyspark_dataproc_wdl

# user defined task
task submit_job {
  String cluster_name
  String cluster_staging_bucket
  File   hailCommandFile
  String inputVds
  String inputAnnot
   
  File outputVdsFileName
  File qcResultsFileName  

  command { ... submit to the cluster and output to cluster staging bucket ... }
}

# workflow that uses template
workflow submit_hail_job {
    String cluster_name
    String cluster_staging_bucket
    File   hailCommandFile
    String inputVds
    String inputAnnot
   
    File outputVdsFileName
    File qcResultsFileName  

    call pyspark_dataproc_wdl.pyspark_dataproc {
      input: dataproc_cluster_specs = {"master_machine_type":"n1-standard-8", "master_machine_disk": 100},
      inject: submit_job = submit_job(input: cluster_name=cluster_name, cluster_staging_bucket=cluster_staging_bucket, 
                                             hailCommandFile=hailCommandFile, inputVds=inputVds, inputAnnot=inputAnnot,
                                             outputVdsFileName=outputVdsFileName, qcResultsFileName=qcResultsFileName)
    }

    output {
        String out1 = variant_calling_template.out1
    }
}

Issue Analytics

State:
Created 6 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

patmageecommented, Apr 1, 2021

I dont think we specifically landed anywhere. My impression is that this goes against the goal of abstraction of WDL from the exectution environment. Its possible this logic could be put into the hints section, but I would not be praticularly keen on voting this forward myself

0reactions

vdauweracommented, Apr 1, 2021

Right. Closing this issue with the recommendation that if someone cares very very strongly they can open a new proposal working out how this would work as a hints-based thing.

Top Results From Across the Web

Getting Started with WDL - Dockstore Documentation!

Workflow Description Language, usually referred to as WDL (“widdle”), is a workflow language with a task section and a workflow section.

openwdl/wdl - Gitter

I just submitted a PR (openwdl/wdl#163) to: Remove the deprecated "wildcard" workflow outputs format; Mention that subworkflow outputs must be named, just like ......

One size doesn't fit all: Reduce computing costs by tailoring ...

That allows you to tailor resource allocations based on what each task ... For an introduction to running workflows with WDL and Cromwell...

West Don Lands Multi-Use Proposal Includes Two Towers

West Don Lands, Block 20, Toronto, designed by Henning Larsen Architects and architectsAlliance Looking southwest to the proposed WDL ...

Land and Resource Management Plan - USDA Forest Service

This plan wdl gmde Forest Se~ce programs on the Smslaw Natlonal Forest ... disruptions will include widely scattered 1250~acre and 2500~acre blocks of ......