Workflow seemingly gets stuck with scattered subworkflow in 30.1
See original GitHub issueThis was run on the latest commit of the 30_hotfix branch - https://github.com/broadinstitute/cromwell/commit/80c416d42291db5d79aaa5c0e24d054769d9fe0a on the PAPI backend
I have a workflow that will run the first task and then not launch the following call to a subworkflow. Based on a little debugging it seems that is has to do with the subworkflow call being inside of a scatter.
For instance, when I run this workflow (shrunk to the important part)
call GetBwaVersion
# Align flowcell-level unmapped input bams in parallel
scatter (unmapped_bam in flowcell_unmapped_bams) {
Float unmapped_bam_size = size(unmapped_bam, "GB")
String sub_strip_path = "gs://.*/"
String sub_strip_unmapped = unmapped_bam_suffix + "$"
String sub_sub = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "")
if (unmapped_bam_size > cutoff_for_large_rg_in_gb) {
# Split bam into multiple smaller bams,
# map reads to reference and recombine into one bam
call splitRG.SplitLargeRG as SplitRG {
input:
input_bam = unmapped_bam,
bwa_commandline = bwa_commandline,
bwa_version = GetBwaVersion.version,
output_bam_basename = sub_sub + ".aligned.unsorted",
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
ref_alt = ref_alt,
ref_amb = ref_amb,
ref_ann = ref_ann,
ref_bwt = ref_bwt,
ref_pac = ref_pac,
ref_sa = ref_sa,
additional_disk = additional_disk,
compression_level = compression_level,
preemptible_tries = preemptible_tries,
bwa_ref_size = bwa_ref_size,
disk_multiplier = bwa_disk_multiplier,
unmapped_bam_size = unmapped_bam_size
}
}
if (unmapped_bam_size <= cutoff_for_large_rg_in_gb) {
# Map reads to reference
call commonTasks.SamToFastqAndBwaMemAndMba {
input:
input_bam = unmapped_bam,
bwa_commandline = bwa_commandline,
output_bam_basename = sub_sub + ".aligned.unsorted",
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
ref_alt = ref_alt,
ref_bwt = ref_bwt,
ref_amb = ref_amb,
ref_ann = ref_ann,
ref_pac = ref_pac,
ref_sa = ref_sa,
bwa_version = GetBwaVersion.version,
# The merged bam can be bigger than only the aligned bam,
# so account for the output size by multiplying the input size by 2.75.
disk_size = unmapped_bam_size + bwa_ref_size + (bwa_disk_multiplier * unmapped_bam_size) + additional_disk,
compression_level = compression_level,
preemptible_tries = preemptible_tries
}
}
}
}
GetBwaVersion
will run and succeed but no SplitRG
workflow will launch afterwards and the workflow will just be stuck in “Running” , this is what the server logs show
2018-01-17 20:38:35,611 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO - WorkflowManagerActor Starting workflow UUID(53c058fc-65db-4347-9433-cc0753614776)
2018-01-17 20:38:35,614 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO - WorkflowManagerActor Successfully started WorkflowActor-53c058fc-65db-4347-9433-cc0753614776
2018-01-17 20:38:35,614 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO - Retrieved 1 workflows from the WorkflowStoreActor
2018-01-17 20:38:36,970 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO - MaterializeWorkflowDescriptorActor [UUID(53c058fc)]: Call-to-Backend assignments: SplitLargeRG.Alignment -> JES, SplitLargeRG.SamSplitter -> JES, SomaticPairedEndSingleSampleWorkflow.GetBwaVersion -> JES, SomaticPairedEndSingleSampleWorkflow.SamToFastqAndBwaMemAndMba -> JES, SplitLargeRG.GatherBamFiless -> JES, SplitLargeRG.SumSplitAlignedSizes -> JES
2018-01-17 20:38:38,399 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO - WorkflowExecutionActor-53c058fc-65db-4347-9433-cc0753614776 [UUID(53c058fc)]: Starting calls: SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1
2018-01-17 20:38:41,234 cromwell-system-akka.dispatchers.backend-dispatcher-47 INFO - JesAsyncBackendJobExecutionActor [UUID(53c058fc)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: `# not setting set -o pipefail here because /bwa has a rc=1 and we dont want to allow rc=1 to succeed because
# the sed may also fail with that error and that is something we actually want to fail on.
/usr/gitc/bwa 2>&1 | \
grep -e '^Version' | \
sed 's/Version: //'`
2018-01-17 20:38:49,163 cromwell-system-akka.dispatchers.backend-dispatcher-47 INFO - JesAsyncBackendJobExecutionActor [UUID(53c058fc)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: job id: operations/ENrHrLeQLBiG_cWio8C8naYBIKngm4KOFSoPcHJvZHVjdGlvblF1ZXVl
2018-01-17 20:39:01,789 cromwell-system-akka.dispatchers.backend-dispatcher-47 INFO - JesAsyncBackendJobExecutionActor [UUID(53c058fc)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: Status change from - to Running
2018-01-17 20:41:59,809 cromwell-system-akka.dispatchers.backend-dispatcher-60 INFO - JesAsyncBackendJobExecutionActor [UUID(53c058fc)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: Status change from Running to Success
Alternatively if I remove the scatter
call GetBwaVersion
# Get the size of the standard reference files as well as the additional reference files needed for BWA
Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
Float bwa_ref_size = ref_size + size(ref_alt, "GB") + size(ref_amb, "GB") + size(ref_ann, "GB") + size(ref_bwt, "GB") + size(ref_pac, "GB") + size(ref_sa, "GB")
Float dbsnp_size = size(dbSNP_vcf, "GB")
# Align flowcell-level unmapped input bams in parallel
File unmapped_bam = flowcell_unmapped_bams[0]
Float unmapped_bam_size = size(unmapped_bam, "GB")
String sub_strip_path = "gs://.*/"
String sub_strip_unmapped = unmapped_bam_suffix + "$"
String sub_sub = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "")
if (unmapped_bam_size > cutoff_for_large_rg_in_gb) {
# Split bam into multiple smaller bams,
# map reads to reference and recombine into one bam
call splitRG.SplitLargeRG as SplitRG {
input:
input_bam = unmapped_bam,
bwa_commandline = bwa_commandline,
bwa_version = GetBwaVersion.version,
output_bam_basename = sub_sub + ".aligned.unsorted",
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
ref_alt = ref_alt,
ref_amb = ref_amb,
ref_ann = ref_ann,
ref_bwt = ref_bwt,
ref_pac = ref_pac,
ref_sa = ref_sa,
additional_disk = additional_disk,
compression_level = compression_level,
preemptible_tries = preemptible_tries,
bwa_ref_size = bwa_ref_size,
disk_multiplier = bwa_disk_multiplier,
unmapped_bam_size = unmapped_bam_size
}
}
if (unmapped_bam_size <= cutoff_for_large_rg_in_gb) {
# Map reads to reference
call commonTasks.SamToFastqAndBwaMemAndMba {
input:
input_bam = unmapped_bam,
bwa_commandline = bwa_commandline,
output_bam_basename = sub_sub + ".aligned.unsorted",
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
ref_alt = ref_alt,
ref_bwt = ref_bwt,
ref_amb = ref_amb,
ref_ann = ref_ann,
ref_pac = ref_pac,
ref_sa = ref_sa,
bwa_version = GetBwaVersion.version,
# The merged bam can be bigger than only the aligned bam,
# so account for the output size by multiplying the input size by 2.75.
disk_size = unmapped_bam_size + bwa_ref_size + (bwa_disk_multiplier * unmapped_bam_size) + additional_disk,
compression_level = compression_level,
preemptible_tries = preemptible_tries
}
}
}
GetBwaVersion
runs and succeeds and then launches the subworkflow. Server log for comparison
2018-01-17 20:52:50,403 cromwell-system-akka.dispatchers.api-dispatcher-23 INFO - Workflow e71c769c-948f-4bd7-8cbe-064a18375966 submitted.
2018-01-17 20:52:53,874 cromwell-system-akka.dispatchers.engine-dispatcher-31 INFO - 1 new workflows fetched
2018-01-17 20:52:53,874 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO - WorkflowManagerActor Starting workflow UUID(e71c769c-948f-4bd7-8cbe-064a18375966)
2018-01-17 20:52:53,874 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO - WorkflowManagerActor Successfully started WorkflowActor-e71c769c-948f-4bd7-8cbe-064a18375966
2018-01-17 20:52:53,875 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO - Retrieved 1 workflows from the WorkflowStoreActor
2018-01-17 20:52:54,947 cromwell-system-akka.dispatchers.engine-dispatcher-37 INFO - MaterializeWorkflowDescriptorActor [UUID(e71c769c)]: Call-to-Backend assignments: SomaticPairedEndSingleSampleWorkflow.GetBwaVersion -> JES, SplitLargeRG.SumSplitAlignedSizes -> JES, SplitLargeRG.GatherBamFiless -> JES, SomaticPairedEndSingleSampleWorkflow.SamToFastqAndBwaMemAndMba -> JES, SplitLargeRG.SamSplitter -> JES, SplitLargeRG.Alignment -> JES
2018-01-17 20:52:56,323 cromwell-system-akka.dispatchers.engine-dispatcher-32 INFO - WorkflowExecutionActor-e71c769c-948f-4bd7-8cbe-064a18375966 [UUID(e71c769c)]: Starting calls: SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1
2018-01-17 20:53:02,487 cromwell-system-akka.dispatchers.backend-dispatcher-44 INFO - JesAsyncBackendJobExecutionActor [UUID(e71c769c)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: `# not setting set -o pipefail here because /bwa has a rc=1 and we dont want to allow rc=1 to succeed because
# the sed may also fail with that error and that is something we actually want to fail on.
/usr/gitc/bwa 2>&1 | \
grep -e '^Version' | \
sed 's/Version: //'`
2018-01-17 20:53:04,348 cromwell-system-akka.dispatchers.backend-dispatcher-56 INFO - JesAsyncBackendJobExecutionActor [UUID(e71c769c)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: job id: operations/EPXh4LeQLBjT9Z2WvIiz-QggqeCbgo4VKg9wcm9kdWN0aW9uUXVldWU
2018-01-17 20:53:15,636 cromwell-system-akka.dispatchers.backend-dispatcher-56 INFO - JesAsyncBackendJobExecutionActor [UUID(e71c769c)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: Status change from - to Running
2018-01-17 20:56:25,285 cromwell-system-akka.dispatchers.backend-dispatcher-43 INFO - JesAsyncBackendJobExecutionActor [UUID(e71c769c)SomaticPairedEndSingleSampleWorkflow.GetBwaVersion:NA:1]: Status change from Running to Success
2018-01-17 20:56:28,223 cromwell-system-akka.dispatchers.engine-dispatcher-37 INFO - WorkflowExecutionActor-e71c769c-948f-4bd7-8cbe-064a18375966 [UUID(e71c769c)]: Starting calls: SubWorkflow-SplitRG:-1:1
2018-01-17 20:56:30,264 cromwell-system-akka.dispatchers.engine-dispatcher-37 INFO - 384e88c5-eba8-400c-aaef-5d618ffdce88-SubWorkflowActor-SubWorkflow-SplitRG:-1:1 [UUID(384e88c5)]: Starting calls: SplitLargeRG.SamSplitter:NA:1
2018-01-17 20:56:30,293 cromwell-system-akka.dispatchers.backend-dispatcher-43 INFO - JesAsyncBackendJobExecutionActor [UUID(384e88c5)SplitLargeRG.SamSplitter:NA:1]: `set -e
mkdir output_dir
total_reads=$(samtools view -c /cromwell_root/broad-dsp-spec-ops-cromwell-execution/CramToUnmappedBams/7db4d00c-0d04-43c5-b480-3cfe6080a3e3/call-SortSam/shard-0/0.1.unmapped.bam)
java -Dsamjdk.compression_level=2 -Xms3000m -jar /usr/gitc/picard.jar SplitSamByNumberOfReads \
INPUT=/cromwell_root/broad-dsp-spec-ops-cromwell-execution/CramToUnmappedBams/7db4d00c-0d04-43c5-b480-3cfe6080a3e3/call-SortSam/shard-0/0.1.unmapped.bam \
OUTPUT=output_dir \
SPLIT_TO_N_READS=48000000 \
TOTAL_READS_IN_INPUT=$total_reads`
2018-01-17 20:56:36,955 cromwell-system-akka.dispatchers.backend-dispatcher-43 INFO - JesAsyncBackendJobExecutionActor [UUID(384e88c5)SplitLargeRG.SamSplitter:NA:1]: job id: operations/EOvc7beQLBiwi6fk-aX5yBEgqeCbgo4VKg9wcm9kdWN0aW9uUXVldWU
2018-01-17 20:56:48,780 cromwell-system-akka.dispatchers.backend-dispatcher-43 INFO - JesAsyncBackendJobExecutionActor [UUID(384e88c5)SplitLargeRG.SamSplitter:NA:1]: Status change from - to Running
Here is the scattered SomaticPairedSingleSampleWf.scattered.txt runnable version that gets stuck running
and the non scattered SomaticPairedSingleSampleWf.single.txt runnable version that works great.
Here is the dependencies zip SomaticPairedSingleSampleWfDependencies.zip
@kcibul i was asked to ping you on this issue
Issue Analytics
- State:
- Created 6 years ago
- Comments:12 (9 by maintainers)
Top GitHub Comments
that worked! Thanks @Horneth !!
workaround is all we need for now. will try it out