AWS Batch Job Cross-talk
See original GitHub issueMy group has been running Cromwell with AWS Batch as part of our pipeline development process and we’ve observed several cases of workflows “silently” failing where no Batch jobs have failed but the workflow log points to missing RC files. From our testing, this issue seems to affect ~10% of the samples that we try to process, although the issue appears “randomly” as there is no single set of samples that reproduces the error time after time. After digging through several logs, I believe I’ve traced the error to an issue where a batch job is being submitted, but it the service finds a previously run job that uses a completely different set of input files and runs that job instead. This incorrect job runs to completion, but the outputs are written to the location specified in the original job, hence that failure to read the RC file.
Below is an edited workflow log that demonstrates the failure:
[2019-05-22 18:42:19,86] [info] Running with database db.url = jdbc:hsqldb:mem:7e164ea8-21fd-4b3a-864c-f8a8ea97645f;shutdown=false;hsqldb.tx=mvcc
[2019-05-22 18:42:25,85] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2019-05-22 18:42:25,86] [info] [RenameWorkflowOptionsInMetadata] 100%
[2019-05-22 18:42:25,92] [info] Running with database db.url = jdbc:hsqldb:mem:d3111f9f-5515-48da-b4c2-c9014a6eb8ab;shutdown=false;hsqldb.tx=mvcc
[2019-05-22 18:42:26,15] [warn] Unrecognized configuration key(s) for AwsBatch: auth, numCreateDefinitionAttempts, numSubmitAttempts
[2019-05-22 18:42:26,41] [info] Slf4jLogger started
[2019-05-22 18:42:26,62] [info] Workflow heartbeat configuration:
{
"cromwellId" : "cromid-c5da692",
"heartbeatInterval" : "2 minutes",
"ttl" : "10 minutes",
"writeBatchSize" : 10000,
"writeThreshold" : 10000
}
[2019-05-22 18:42:26,66] [info] Metadata summary refreshing every 2 seconds.
[2019-05-22 18:42:26,69] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2019-05-22 18:42:26,69] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2019-05-22 18:42:26,71] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2019-05-22 18:42:27,30] [info] JobExecutionTokenDispenser - Distribution rate: 50 per 1 seconds.
[2019-05-22 18:42:27,31] [info] SingleWorkflowRunnerActor: Version 36
[2019-05-22 18:42:27,35] [info] Unspecified type (Unspecified version) workflow 3997371c-9513-4386-a579-a72639c6e960 submitted
[2019-05-22 18:42:27,36] [info] SingleWorkflowRunnerActor: Workflow submitted 3997371c-9513-4386-a579-a72639c6e960
[2019-05-22 18:42:27,36] [info] WorkflowManagerActor Starting workflow 3997371c-9513-4386-a579-a72639c6e960
[2019-05-22 18:42:27,36] [info] WorkflowManagerActor Successfully started WorkflowActor-3997371c-9513-4386-a579-a72639c6e960
...
[2019-05-22 19:15:20,74] [info] 755021ae-948b-47f9-94a8-66b486bda47d-SubWorkflowActor-SubWorkflow-Haplotypecaller:0:1 [755021ae]: Starting Haplotypecaller.SplitFilesByChromosome
[2019-05-22 19:15:21,34] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.SplitFilesByChromosome:NA:1]: set -e
for chr in grep -v '@' /cromwell_root/s4-pbg-hc/References/HC_Panel_v3.intervals | cut -f1 | sort | uniq
do
grep -v '@' /cromwell_root/s4-pbg-hc/References/HC_Panel_v3.intervals | grep -w $chr | awk '{ print $1":"$2"-"$3 }' > $chr.intervals
samtools view -@ 15 -b -h /cromwell_root/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Alignment/alignment.Alignment/6e782168-d056-4ac9-b83b-5fba843fffc1/call-baseRecalibrator/shard-0/RSM278260-6_8plex.dedup.recal.bam $chr > $chr.RSM278260-6_8plex.dedup.recal.bam
done
[2019-05-22 19:15:21,35] [info] Submitting job to AWS Batch
[2019-05-22 19:15:21,35] [info] dockerImage: 260062248592.dkr.ecr.us-east-1.amazonaws.com/s4-alignandmolvar:1.3.2
[2019-05-22 19:15:21,35] [info] jobQueueArn: arn:aws:batch:us-east-1:260062248592:job-queue/GenomicsDefaultQueue-80d8b8f0-15ed-11e9-b8b7-12ddf705bbc4
[2019-05-22 19:15:21,35] [info] taskId: Haplotypecaller.SplitFilesByChromosome-None-1
[2019-05-22 19:15:21,35] [info] hostpath root: hc.Haplotypecaller/hc.SplitFilesByChromosome/755021ae-948b-47f9-94a8-66b486bda47d/None/1
[2019-05-22 19:15:21,71] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.SplitFilesByChromosome:NA:1]: job id: 8ec19f2b-5b49-4422-9ad1-5b51e3db9414
[2019-05-22 19:15:21,77] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.SplitFilesByChromosome:NA:1]: Status change from - to Initializing
[2019-05-22 19:15:26,67] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.SplitFilesByChromosome:NA:1]: Status change from Initializing to Running
[2019-05-22 19:19:12,28] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.SplitFilesByChromosome:NA:1]: Status change from Running to Succeeded
[2019-05-22 19:19:18,44] [info] 755021ae-948b-47f9-94a8-66b486bda47d-SubWorkflowActor-SubWorkflow-Haplotypecaller:0:1 [755021ae]: Starting Haplotypecaller.HC_GVCF (23 shards)
[2019-05-22 19:19:19,34] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:1:1]: set -e
sambamba index -t 4 /cromwell_root/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-SplitFilesByChromosome/glob-313957810a5e411f50b17b2a7d630ef7/chr10.RSM278260-6_8plex.dedup.recal.bam
gatk HaplotypeCaller \
--java-options -Djava.io.tmpdir='' \
-R /cromwell_root/s4-ngs-resources-sandbox/Genomic/Broad/hg19/ucsc.hg19.fasta \
--dbsnp /cromwell_root/s4-ngs-resources-sandbox/Variant/Broad/hg19/dbsnp_138.hg19.vcf.gz \
--native-pair-hmm-threads 16 \
-L /cromwell_root/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-SplitFilesByChromosome/glob-6f4bc12a708659d4f5f3eecd1cdffff7/chr10.intervals \
-I /cromwell_root/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-SplitFilesByChromosome/glob-313957810a5e411f50b17b2a7d630ef7/chr10.RSM278260-6_8plex.dedup.recal.bam \
-O RSM278260-6_8plex.hc.gvcf.gz \
-ERC GVCF \
\
[2019-05-22 19:19:19,34] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:6:1]: set -e
sambamba index -t 4 /cromwell_root/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-SplitFilesByChromosome/glob-313957810a5e411f50b17b2a7d630ef7/chr15.RSM278260-6_8plex.dedup.recal.bam
gatk HaplotypeCaller \
--java-options -Djava.io.tmpdir='' \
-R /cromwell_root/s4-ngs-resources-sandbox/Genomic/Broad/hg19/ucsc.hg19.fasta \
--dbsnp /cromwell_root/s4-ngs-resources-sandbox/Variant/Broad/hg19/dbsnp_138.hg19.vcf.gz \
--native-pair-hmm-threads 16 \
-L /cromwell_root/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-SplitFilesByChromosome/glob-6f4bc12a708659d4f5f3eecd1cdffff7/chr15.intervals \
-I /cromwell_root/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-SplitFilesByChromosome/glob-313957810a5e411f50b17b2a7d630ef7/chr15.RSM278260-6_8plex.dedup.recal.bam \
-O RSM278260-6_8plex.hc.gvcf.gz \
-ERC GVCF \
\
[2019-05-22 19:19:19,34] [info] Submitting job to AWS Batch
[2019-05-22 19:19:19,34] [info] dockerImage: 260062248592.dkr.ecr.us-east-1.amazonaws.com/s4-alignandmolvar:1.3.2
[2019-05-22 19:19:19,34] [info] jobQueueArn: arn:aws:batch:us-east-1:260062248592:job-queue/GenomicsDefaultQueue-80d8b8f0-15ed-11e9-b8b7-12ddf705bbc4
[2019-05-22 19:19:19,34] [info] taskId: Haplotypecaller.HC_GVCF-Some(1)-1
[2019-05-22 19:19:19,34] [info] hostpath root: hc.Haplotypecaller/hc.HC_GVCF/755021ae-948b-47f9-94a8-66b486bda47d/Some(1)/1
...
[2019-05-22 19:19:19,34] [info] Submitting job to AWS Batch
[2019-05-22 19:19:19,34] [info] dockerImage: 260062248592.dkr.ecr.us-east-1.amazonaws.com/s4-alignandmolvar:1.3.2
[2019-05-22 19:19:19,34] [info] jobQueueArn: arn:aws:batch:us-east-1:260062248592:job-queue/GenomicsDefaultQueue-80d8b8f0-15ed-11e9-b8b7-12ddf705bbc4
[2019-05-22 19:19:19,34] [info] taskId: Haplotypecaller.HC_GVCF-Some(6)-1
[2019-05-22 19:19:19,34] [info] hostpath root: hc.Haplotypecaller/hc.HC_GVCF/755021ae-948b-47f9-94a8-66b486bda47d/Some(6)/1
...
[2019-05-22 19:19:19,51] [warn] Job definition already exists. Performing describe and retrieving latest revision.
[2019-05-22 19:19:21,71] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:1:1]: job id: 45a77017-89a7-45c0-8b8b-d40ae2420212
[2019-05-22 19:19:21,76] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:1:1]: Status change from - to Initializing
[2019-05-22 19:19:26,71] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:6:1]: job id: 7c2d29c2-f04e-4b3f-8579-915a6fbc9033
[2019-05-22 19:19:26,76] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:6:1]: Status change from - to Initializing
[2019-05-22 19:19:27,42] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:6:1]: Status change from Initializing to Running
...
[2019-05-22 19:21:09,63] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:1:1]: Status change from Initializing to Running
...
[2019-05-22 19:22:43,83] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:1:1]: Status change from Running to Succeeded
...
[2019-05-22 19:34:19,31] [info] AwsBatchAsyncBackendJobExecutionActor [755021aeHaplotypecaller.HC_GVCF:6:1]: Status change from Running to Succeeded
...
[2019-05-22 19:42:10,31] [error] WorkflowManagerActor Workflow 3997371c-9513-4386-a579-a72639c6e960 failed (during ExecutingWorkflowState):
cromwell.engine.io.IoAttempts$EnhancedCromwellIoException: [Attempted 1 time(s)] - IOException: Could not read from s3://s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-HC_GVCF/shard-6/HC_GVCF-6-rc.txt: s3://s3.amazonaws.com/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-HC_GVCF/shard-6/HC_GVCF-6-rc.txt
Caused by: java.io.IOException: Could not read from s3://s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-HC_GVCF/shard-6/HC_GVCF-6-rc.txt: s3://s3.amazonaws.com/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-HC_GVCF/shard-6/HC_GVCF-6-rc.txt
Caused by: java.nio.file.NoSuchFileException: s3://s3.amazonaws.com/s4-pbg-hc/HC_Dev_Run_5/Pipeline/RSM278260-6_8plex/pipeline_workflow/3997371c-9513-4386-a579-a72639c6e960/call-Haplotypecaller/shard-0/hc.Haplotypecaller/755021ae-948b-47f9-94a8-66b486bda47d/call-HC_GVCF/shard-6/HC_GVCF-6-rc.txt
...
[2019-05-22 19:42:10,31] [info] WorkflowManagerActor WorkflowActor-3997371c-9513-4386-a579-a72639c6e960 is in a terminal state: WorkflowFailedState
[2019-05-22 19:42:59,50] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
...
Workflow 3997371c-9513-4386-a579-a72639c6e960 transitioned to state Failed
Pulling the actual AWS Batch Job parameters for the “failed” job (7c2d29c2-f04e-4b3f-8579-915a6fbc9033) I see the following:
{"jobs": [{
"status": "SUCCEEDED",
"container": {
"mountPoints": [{"sourceVolume": "local-disk", "containerPath": "/cromwell_root"}],
"taskArn": "arn:aws:ecs:us-east-1:260062248592:task/78221618-403c-4b10-b9e1-6c1534a44723",
"logStreamName": "hc_Haplotypecaller-hc_HC_GVCF/default/78221618-403c-4b10-b9e1-6c1534a44723",
"image": "260062248592.dkr.ecr.us-east-1.amazonaws.com/s4-TN-alignandmolvar:1.3.2",
"containerInstanceArn": "arn:aws:ecs:us-east-1:260062248592:container-instance/3cfe8456-fd3e-420d-91bc-aa1d8d134194",
"environment": [
{"name": "AWS_CROMWELL_LOCAL_DISK", "value": "/cromwell_root"},
{"name": "AWS_CROMWELL_CALL_ROOT",
"value": "s3://dev-nphi-cromwell-v8/cromwell-execution/TN_workflow/2b65d83d-7d30-465e-9127-95c6886056e4/call-Haplotypecaller/shard-1/hc.Haplotypecaller/cfe96bd8-ee6b-4ba5-8ed8-198e6f5f9589/call-HC_GVCF/shard-23"},
{"name": "AWS_CROMWELL_OUTPUTS",
"value": "Run02_Pair003_Lane1_Tumor.hc.gvcf.gz,s3://dev-nphi-cromwell-v8/cromwell-execution/TN_workflow/2b65d83d-7d30-465e-9127-95c6886056e4/call-Haplotypecaller/shard-1/hc.Haplotypecaller/cfe96bd8-ee6b-4ba5-8ed8-198e6f5f9589/call-HC_GVCF/shard-23/Run02_Pair003_Lane1_Tumor.hc.gvcf.gz,Run02_Pair003_Lane1_Tumor.hc.gvcf.gz,local-disk /cromwell_root;Run02_Pair003_Lane1_Tumor.hc.gvcf.gz.tbi,s3://dev-nphi-cromwell-v8/cromwell-execution/TN_workflow/2b65d83d-7d30-465e-9127-95c6886056e4/call-Haplotypecaller/shard-1/hc.Haplotypecaller/cfe96bd8-ee6b-4ba5-8ed8-198e6f5f9589/call-HC_GVCF/shard-23/Run02_Pair003_Lane1_Tumor.hc.gvcf.gz.tbi,Run02_Pair003_Lane1_Tumor.hc.gvcf.gz.tbi,local-disk /cromwell_root"},
{"name": "AWS_CROMWELL_INPUTS_GZ",
"value": "..."},
{"name": "AWS_CROMWELL_STDERR_FILE", "value": "/cromwell_root/HC_GVCF-23-stderr.log"},
{"name": "AWS_CROMWELL_STDOUT_FILE", "value": "/cromwell_root/HC_GVCF-23-stdout.log"},
{"name": "AWS_CROMWELL_PATH", "value": "hc.Haplotypecaller/hc.HC_GVCF/cfe96bd8-ee6b-4ba5-8ed8-198e6f5f9589/Some(23)/1"},
{"name": "AWS_CROMWELL_RC_FILE", "value": "/cromwell_root/HC_GVCF-23-rc.txt"},
{"name": "AWS_CROMWELL_WORKFLOW_ROOT",
"value": "s3://dev-nphi-cromwell-v8/cromwell-execution/TN_workflow/2b65d83d-7d30-465e-9127-95c6886056e4/call-Haplotypecaller/shard-1/hc.Haplotypecaller/cfe96bd8-ee6b-4ba5-8ed8-198e6f5f9589/"}
],
"vcpus": 16,
"command": [
"gzipdata", "/bin/bash", "-c",
"..."
],
"volumes": [{"host": {
"sourcePath": "/cromwell_root/hc.Haplotypecaller/hc.HC_GVCF/cfe96bd8-ee6b-4ba5-8ed8-198e6f5f9589/Some(23)/1"}, "name": "local-disk"}],
"memory": 32000, "ulimits": [], "exitCode": 0},
"parameters": {},
"jobDefinition": "arn:aws:batch:us-east-1:260062248592:job-definition/hc_Haplotypecaller-hc_HC_GVCF:19527",
"statusReason": "Essential container in task exited",
"jobId": "7c2d29c2-f04e-4b3f-8579-915a6fbc9033",
"attempts": [{
"startedAt": 1558552881926, "container": {
"taskArn": "arn:aws:ecs:us-east-1:260062248592:task/78221618-403c-4b10-b9e1-6c1534a44723",
"containerInstanceArn": "arn:aws:ecs:us-east-1:260062248592:container-instance/3cfe8456-fd3e-420d-91bc-aa1d8d134194",
"logStreamName": "hc_Haplotypecaller-hc_HC_GVCF/default/78221618-403c-4b10-b9e1-6c1534a44723",
"exitCode": 0}, "stoppedAt": 1558553539743, "statusReason": "Essential container in task exited"}],
"jobQueue": "arn:aws:batch:us-east-1:260062248592:job-queue/GenomicsDefaultQueue-80d8b8f0-15ed-11e9-b8b7-12ddf705bbc4",
"dependsOn": [],
"startedAt": 1558552881926,
"jobName": "Haplotypecaller_HC_GVCF",
"createdAt": 1558552763368, "stoppedAt": 1558553539743}]}
Clearly, the AWS Batch job parameters are referencing a completely different set of input files from the set described in the workflow log. In this particular case, the job described in the log was started via cromwell run using v36 on an isolated EC2 instance, while the workflow described by the job parameters json was submitted to a cromwell v36.1 server running on a completely separate EC2 instance. This would point to call caching NOT being the problem but a more fundamental issue with how Cromwell interfaces with the AWS Batch backend to submit jobs.
We’ve also observed this result using Cromwell v40 and 41, the latter using a completely new stack created just for that version, in both run and server modes.
If more information is needed, please reach out and we’ll provide what we can; the transient nature of the Batch job parameters and the lack of a set of cases that reliably reproduce this error has made it difficult for us to investigate and we’re hoping developer assistance can get this resolved.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:9 (4 by maintainers)
Top GitHub Comments
I’m wondering if this is related to cromwell creating new job definitions for every new call, versus using parameter substitution to modify the inputs for a single job definition? There may be some sort of backend issue with the integration to the AWS APIs that and old job definition is being called incorrectly instead of yet another new definition being created with the correct inputs?
This would track with the workflow log saying that the job definition already exists and then re-using a job that has inputs for a completely different sample.
Deregistering a job definition is probably not recommended since it will likely have similar constraints as the current method of creating a job def for each task call. Reducing the number of job definitions that need to be created would be better long term. That would require changing how task specific paths on the host instance are created - e.g. using a combination of job environment variables and additional commands to container overrides.