question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Workflow seemingly gets stuck with scattered subworkflow in 30.1

See original GitHub issue

This was run on the latest commit of the 30_hotfix branch - https://github.com/broadinstitute/cromwell/commit/80c416d42291db5d79aaa5c0e24d054769d9fe0a on the PAPI backend

I have a workflow that will run the first task and then not launch the following call to a subworkflow. Based on a little debugging it seems that is has to do with the subworkflow call being inside of a scatter.

For instance, when I run this workflow (shrunk to the important part)

  call GetBwaVersion

  # Align flowcell-level unmapped input bams in parallel
  scatter (unmapped_bam in flowcell_unmapped_bams) {

    Float unmapped_bam_size = size(unmapped_bam, "GB")

    String sub_strip_path = "gs://.*/"
    String sub_strip_unmapped = unmapped_bam_suffix + "$"
    String sub_sub = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "")

    if (unmapped_bam_size > cutoff_for_large_rg_in_gb) {
      # Split bam into multiple smaller bams,
      # map reads to reference and recombine into one bam
      call splitRG.SplitLargeRG as SplitRG {
        input:
          input_bam = unmapped_bam,
          bwa_commandline = bwa_commandline,
          bwa_version = GetBwaVersion.version,
          output_bam_basename = sub_sub + ".aligned.unsorted",
          ref_fasta = ref_fasta,
          ref_fasta_index = ref_fasta_index,
          ref_dict = ref_dict,
          ref_alt = ref_alt,
          ref_amb = ref_amb,
          ref_ann = ref_ann, 
          ref_bwt = ref_bwt,
          ref_pac = ref_pac,
          ref_sa = ref_sa,
          additional_disk = additional_disk,
          compression_level = compression_level,
          preemptible_tries = preemptible_tries,
          bwa_ref_size = bwa_ref_size,
          disk_multiplier = bwa_disk_multiplier,
          unmapped_bam_size = unmapped_bam_size
      }
    }

    if (unmapped_bam_size <= cutoff_for_large_rg_in_gb) {
      # Map reads to reference
      call commonTasks.SamToFastqAndBwaMemAndMba {
        input:
          input_bam = unmapped_bam,
          bwa_commandline = bwa_commandline,
          output_bam_basename = sub_sub + ".aligned.unsorted",
          ref_fasta = ref_fasta,
          ref_fasta_index = ref_fasta_index,
          ref_dict = ref_dict,
          ref_alt = ref_alt,
          ref_bwt = ref_bwt,
          ref_amb = ref_amb,
          ref_ann = ref_ann,
          ref_pac = ref_pac,
          ref_sa = ref_sa,
          bwa_version = GetBwaVersion.version,
          # The merged bam can be bigger than only the aligned bam,
          # so account for the output size by multiplying the input size by 2.75.
          disk_size = unmapped_bam_size + bwa_ref_size + (bwa_disk_multiplier * unmapped_bam_size) + additional_disk,
          compression_level = compression_level,
          preemptible_tries = preemptible_tries
      }
    }
  }
}

GetBwaVersion will run and succeed but no SplitRG workflow will launch afterwards and the workflow will just be stuck in “Running” , this is what the server logs show

2018-01-17 20:38:35,611 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO  - WorkflowManagerActor Starting workflow UUID(53c058fc-65db-4347-9433-cc0753614776)
2018-01-17 20:38:35,614 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO  - WorkflowManagerActor Successfully started WorkflowActor-53c058fc-65db-4347-9433-cc0753614776
2018-01-17 20:38:35,614 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO  - Retrieved 1 workflows from the WorkflowStoreActor
2018-01-17 20:38:36,970 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  - MaterializeWorkflowDescriptorActor [UUID(53c058fc)]: Call-to-Backend assignments: SplitLargeRG.Alignment -> JES, SplitLargeRG.SamSplitter -> JES, SomaticPairedEndSingleSampleWorkflow.GetBwaVersion -> JES, SomaticPairedEndSingleSampleWorkflow.SamToFastqAndBwaMemAndMba -> JES, SplitLargeRG.GatherBamFiless -> JES, SplitLargeRG.SumSplitAlignedSizes -> JES
2018-01-17 20:38:38,399 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO  - WorkflowExecutionActor-53c058fc-65db-4347-9433-cc0753614776 [UUID(53c058fc)]: Starting calls: SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1
2018-01-17 20:38:41,234 cromwell-system-akka.dispatchers.backend-dispatcher-47 INFO  - JesAsyncBackendJobExecutionActor [UUID(53c058fc)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: `# not setting set -o pipefail here because /bwa has a rc=1 and we dont want to allow rc=1 to succeed because
# the sed may also fail with that error and that is something we actually want to fail on.
/usr/gitc/bwa 2>&1 | \
grep -e '^Version' | \
sed 's/Version: //'`
2018-01-17 20:38:49,163 cromwell-system-akka.dispatchers.backend-dispatcher-47 INFO  - JesAsyncBackendJobExecutionActor [UUID(53c058fc)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: job id: operations/ENrHrLeQLBiG_cWio8C8naYBIKngm4KOFSoPcHJvZHVjdGlvblF1ZXVl
2018-01-17 20:39:01,789 cromwell-system-akka.dispatchers.backend-dispatcher-47 INFO  - JesAsyncBackendJobExecutionActor [UUID(53c058fc)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: Status change from - to Running
2018-01-17 20:41:59,809 cromwell-system-akka.dispatchers.backend-dispatcher-60 INFO  - JesAsyncBackendJobExecutionActor [UUID(53c058fc)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: Status change from Running to Success

Alternatively if I remove the scatter

  call GetBwaVersion

  # Get the size of the standard reference files as well as the additional reference files needed for BWA
  Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
  Float bwa_ref_size = ref_size + size(ref_alt, "GB") + size(ref_amb, "GB") + size(ref_ann, "GB") + size(ref_bwt, "GB") + size(ref_pac, "GB") + size(ref_sa, "GB")
  Float dbsnp_size = size(dbSNP_vcf, "GB")

  # Align flowcell-level unmapped input bams in parallel
  File unmapped_bam = flowcell_unmapped_bams[0]

    Float unmapped_bam_size = size(unmapped_bam, "GB")

    String sub_strip_path = "gs://.*/"
    String sub_strip_unmapped = unmapped_bam_suffix + "$"
    String sub_sub = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "")

    if (unmapped_bam_size > cutoff_for_large_rg_in_gb) {
      # Split bam into multiple smaller bams,
      # map reads to reference and recombine into one bam
      call splitRG.SplitLargeRG as SplitRG {
        input:
          input_bam = unmapped_bam,
          bwa_commandline = bwa_commandline,
          bwa_version = GetBwaVersion.version,
          output_bam_basename = sub_sub + ".aligned.unsorted",
          ref_fasta = ref_fasta,
          ref_fasta_index = ref_fasta_index,
          ref_dict = ref_dict,
          ref_alt = ref_alt,
          ref_amb = ref_amb,
          ref_ann = ref_ann, 
          ref_bwt = ref_bwt,
          ref_pac = ref_pac,
          ref_sa = ref_sa,
          additional_disk = additional_disk,
          compression_level = compression_level,
          preemptible_tries = preemptible_tries,
          bwa_ref_size = bwa_ref_size,
          disk_multiplier = bwa_disk_multiplier,
          unmapped_bam_size = unmapped_bam_size
      }
    }

    if (unmapped_bam_size <= cutoff_for_large_rg_in_gb) {
      # Map reads to reference
      call commonTasks.SamToFastqAndBwaMemAndMba {
        input:
          input_bam = unmapped_bam,
          bwa_commandline = bwa_commandline,
          output_bam_basename = sub_sub + ".aligned.unsorted",
          ref_fasta = ref_fasta,
          ref_fasta_index = ref_fasta_index,
          ref_dict = ref_dict,
          ref_alt = ref_alt,
          ref_bwt = ref_bwt,
          ref_amb = ref_amb,
          ref_ann = ref_ann,
          ref_pac = ref_pac,
          ref_sa = ref_sa,
          bwa_version = GetBwaVersion.version,
          # The merged bam can be bigger than only the aligned bam,
          # so account for the output size by multiplying the input size by 2.75.
          disk_size = unmapped_bam_size + bwa_ref_size + (bwa_disk_multiplier * unmapped_bam_size) + additional_disk,
          compression_level = compression_level,
          preemptible_tries = preemptible_tries
      }
    }
}

GetBwaVersion runs and succeeds and then launches the subworkflow. Server log for comparison

2018-01-17 20:52:50,403 cromwell-system-akka.dispatchers.api-dispatcher-23 INFO  - Workflow e71c769c-948f-4bd7-8cbe-064a18375966 submitted.
2018-01-17 20:52:53,874 cromwell-system-akka.dispatchers.engine-dispatcher-31 INFO  - 1 new workflows fetched
2018-01-17 20:52:53,874 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  - WorkflowManagerActor Starting workflow UUID(e71c769c-948f-4bd7-8cbe-064a18375966)
2018-01-17 20:52:53,874 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  - WorkflowManagerActor Successfully started WorkflowActor-e71c769c-948f-4bd7-8cbe-064a18375966
2018-01-17 20:52:53,875 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  - Retrieved 1 workflows from the WorkflowStoreActor
2018-01-17 20:52:54,947 cromwell-system-akka.dispatchers.engine-dispatcher-37 INFO  - MaterializeWorkflowDescriptorActor [UUID(e71c769c)]: Call-to-Backend assignments: SomaticPairedEndSingleSampleWorkflow.GetBwaVersion -> JES, SplitLargeRG.SumSplitAlignedSizes -> JES, SplitLargeRG.GatherBamFiless -> JES, SomaticPairedEndSingleSampleWorkflow.SamToFastqAndBwaMemAndMba -> JES, SplitLargeRG.SamSplitter -> JES, SplitLargeRG.Alignment -> JES
2018-01-17 20:52:56,323 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO  - WorkflowExecutionActor-e71c769c-948f-4bd7-8cbe-064a18375966 [UUID(e71c769c)]: Starting calls: SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1
2018-01-17 20:53:02,487 cromwell-system-akka.dispatchers.backend-dispatcher-44 INFO  - JesAsyncBackendJobExecutionActor [UUID(e71c769c)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: `# not setting set -o pipefail here because /bwa has a rc=1 and we dont want to allow rc=1 to succeed because
# the sed may also fail with that error and that is something we actually want to fail on.
/usr/gitc/bwa 2>&1 | \
grep -e '^Version' | \
sed 's/Version: //'`
2018-01-17 20:53:04,348 cromwell-system-akka.dispatchers.backend-dispatcher-56 INFO  - JesAsyncBackendJobExecutionActor [UUID(e71c769c)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: job id: operations/EPXh4LeQLBjT9Z2WvIiz-QggqeCbgo4VKg9wcm9kdWN0aW9uUXVldWU
2018-01-17 20:53:15,636 cromwell-system-akka.dispatchers.backend-dispatcher-56 INFO  - JesAsyncBackendJobExecutionActor [UUID(e71c769c)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: Status change from - to Running
2018-01-17 20:56:25,285 cromwell-system-akka.dispatchers.backend-dispatcher-43 INFO  - JesAsyncBackendJobExecutionActor [UUID(e71c769c)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: Status change from Running to Success
2018-01-17 20:56:28,223 cromwell-system-akka.dispatchers.engine-dispatcher-37 INFO  - WorkflowExecutionActor-e71c769c-948f-4bd7-8cbe-064a18375966 [UUID(e71c769c)]: Starting calls: SubWorkflow-SplitRG:-1:1
2018-01-17 20:56:30,264 cromwell-system-akka.dispatchers.engine-dispatcher-37 INFO  - 384e88c5-eba8-400c-aaef-5d618ffdce88-SubWorkflowActor-SubWorkflow-SplitRG:-1:1 [UUID(384e88c5)]: Starting calls: SplitLargeRG.SamSplitter:NA:1
2018-01-17 20:56:30,293 cromwell-system-akka.dispatchers.backend-dispatcher-43 INFO  - JesAsyncBackendJobExecutionActor [UUID(384e88c5)SplitLargeRG.SamSplitter:NA:1]: `set -e
mkdir output_dir

total_reads=$(samtools view -c /cromwell_root/broad-dsp-spec-ops-cromwell-execution/CramToUnmappedBams/7db4d00c-0d04-43c5-b480-3cfe6080a3e3/call-SortSam/shard-0/0.1.unmapped.bam)

java -Dsamjdk.compression_level=2 -Xms3000m -jar /usr/gitc/picard.jar SplitSamByNumberOfReads \
  INPUT=/cromwell_root/broad-dsp-spec-ops-cromwell-execution/CramToUnmappedBams/7db4d00c-0d04-43c5-b480-3cfe6080a3e3/call-SortSam/shard-0/0.1.unmapped.bam \
  OUTPUT=output_dir \
  SPLIT_TO_N_READS=48000000 \
  TOTAL_READS_IN_INPUT=$total_reads`
2018-01-17 20:56:36,955 cromwell-system-akka.dispatchers.backend-dispatcher-43 INFO  - JesAsyncBackendJobExecutionActor [UUID(384e88c5)SplitLargeRG.SamSplitter:NA:1]: job id: operations/EOvc7beQLBiwi6fk-aX5yBEgqeCbgo4VKg9wcm9kdWN0aW9uUXVldWU
2018-01-17 20:56:48,780 cromwell-system-akka.dispatchers.backend-dispatcher-43 INFO  - JesAsyncBackendJobExecutionActor [UUID(384e88c5)SplitLargeRG.SamSplitter:NA:1]: Status change from - to Running

Here is the scattered SomaticPairedSingleSampleWf.scattered.txt runnable version that gets stuck running

and the non scattered SomaticPairedSingleSampleWf.single.txt runnable version that works great.

Here is the dependencies zip SomaticPairedSingleSampleWfDependencies.zip

@kcibul i was asked to ping you on this issue

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
jsotobroadcommented, Jan 18, 2018

that worked! Thanks @Horneth !!

0reactions
jsotobroadcommented, Jan 18, 2018

workaround is all we need for now. will try it out

Read more comments on GitHub >

github_iconTop Results From Across the Web

Workflow stuck in error, not triggering sub-workflow
Hello, We have a workflow in error, it's not creating the subworkflow. This is the technical log, do you know where we can...
Read more >
Solved: workflow is getting stuck here and not moving even...
here we have an issue in the workflow , as per the flow once the 1st task has closed / closed no response...
Read more >
Task in Workflow stuck in ACTIVE status although it has ...
The master workflow gets stuck in a status of ACTIVE because the sub-workflow does not end and is also shown as ACTIVE in...
Read more >
[BUG]: workflow get stuck apparently for no reason #2369
hello guys, i installed conductor latest version (3.2.0) into k8s cluster and i'm using dynomite for persistence, this is my configuration ...
Read more >
Subworkflows | Workflows - Google Cloud
Subworkflows support default values for parameters. The default parameter value is used only if a parameter isn't provided as part of a subworkflow...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found