This job was passed a promise that wasn't yet resolved when it ran
See original GitHub issueHi Toil team,
I’m getting random pipeline failures in about 20-30% of pipeline runs. When it happens a restart will consistently fail the pipeline run but a relaunch with the same inputs usually succeeds. I know how hard is to debug non-reproducible issues, but I’m hoping this has come up before.
I’m using Toil 3.15.0. My pipeline is written in CWL. Because of local limitations I’m running with --no-container
. Jobs are running in a local slurm HPC with these arguments:
cwltoil \
--no-container \
--batchSystem slurm \
--disableCaching \
--retryCount 1 \
--defaultCores 4 \
--defaultMemory 20G \
--logFile pipeline/run.log \
--jobStore pipeline/run.jobstore \
--workDir /tmp \
--outdir output/run \
pipeline/workflow.cwl \
pipeline/run-job.json
This is the crash log:
Job ended successfully: 'file:///mnt/pipeline/tools/samtools/1.7/samtools_merge.cwl' samtools merge i/F/jobvn7O2e
The job seems to have left a log file, indicating failure: 'ResolveIndirect' 4/0/jobajDurU
4/0/jobajDurU INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
4/0/jobajDurU INFO:toil:Running Toil version 3.14.0-b91dbf9bf6116879952f0a70f9a2fbbcae7e51b6.
4/0/jobajDurU Traceback (most recent call last):
4/0/jobajDurU File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/worker.py", line 294, in workerScript
4/0/jobajDurU job = Job._loadJob(jobGraph.command, jobStore)
4/0/jobajDurU File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 907, in _loadJob
4/0/jobajDurU return cls._unpickle(userModule, fileHandle, jobStore.config)
4/0/jobajDurU File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 936, in _unpickle
4/0/jobajDurU runnable = unpickler.load()
4/0/jobajDurU File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1740, in __new__
4/0/jobajDurU return cls._resolve(*args)
4/0/jobajDurU File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1752, in _resolve
4/0/jobajDurU value = pickle.load(fileHandle)
4/0/jobajDurU File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1740, in __new__
4/0/jobajDurU return cls._resolve(*args)
4/0/jobajDurU File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1752, in _resolve
4/0/jobajDurU value = pickle.load(fileHandle)
4/0/jobajDurU File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1832, in __setstate__
4/0/jobajDurU "this job as a child/follow-on of {jobName}.".format(jobName=jobName))
4/0/jobajDurU RuntimeError: This job was passed a promise that wasn't yet resolved when it ran. The job 'CWLGather' that fulfills this promise hasn't yet finished. This means that there aren't enough constraints to ensure the current job always runs after 'CWLGather'. Consider adding a follow-on indirection between this job and its parent, or adding this job as a child/follow-on of 'CWLGather'.
4/0/jobajDurU ERROR:toil.worker:Exiting the worker because of a failed job on host hpcc-1
4/0/jobajDurU WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'ResolveIndirect' 4/0/jobajDurU with ID 4/0/jobajDurU to 1
<...snip...>
Job 'ResolveIndirect' 4/0/jobajDurU with ID 4/0/jobajDurU is completely failed
No jobs left to run so exiting.
Finished the main loop
Waiting for service manager thread to finish ...
... finished shutting down the service manager. Took 0.730870962143 seconds
Waiting for stats and logging collator thread to finish ...
... finished collating stats and logs. Took 0.399125814438 seconds
Finished toil run with 2 failed jobs
Failed jobs at end of the run: 'CWLScatter' M/D/jobopIemN 'ResolveIndirect' 4/0/jobajDurU
All the crashes happen at exactly the same spot. The jobstore is NFS mounted on all nodes, while the workdir is a local scratch folder. The task failing is an input merging step after a sub-workflow step which has a scatter step. I wonder if having the scatter step inside the sub-workflow might be what is confusing Toil. Here are the CWL sections I think might be relevant:
sample_processing.cwl
steps:
process_reads:
in:
cutadapt_threads: cutadapt_threads
i5_adapter: i5_adapter
i7_adapter: i7_adapter
index_base: reference_file
minimum_length: minimum_length
read_group:
source: sample
valueFrom: $(self.read_group)
reads:
source: sample
valueFrom: $(self.reads)
sample_name:
source: sample
valueFrom: $(self.sample_name)
out:
- trimmed_reads
- trim_reads_log
- quality_html_report
- quality_compressed_report
- alignments
run: reads_processing.cwl
merge:
in:
alignment_files: process_reads/alignments
output_filename:
source: sample
valueFrom: "$(self.sample_name).bam"
out: [output_file]
run: ../../../tools/samtools/1.7/samtools_merge.cwl
reads_processing.cwl
steps:
quality:
in:
reads: reads
scatter: reads
out: [html_report, compressed_report]
run: ../../../tools/fastqc/0.11.7/fastqc.cwl
trim:
in:
adapter: i7_adapter
cores: cutadapt_threads
front: i5_adapter
minimum_length: minimum_length
output:
source: reads
valueFrom: $(self.basename.replace('.fastq.gz', '.trimmed.fastq.gz'))
reads: reads
scatter: [output, reads]
scatterMethod: dotproduct
out: [trimmed_reads, log]
run: ../../../tools/cutadapt/v1.16/cutadapt.cwl
align:
in:
index_base: index_base
read_group: read_group
reads_1:
source: trim/trimmed_reads
output:
source: reads
valueFrom: $(self.basename.replace('.fastq.gz', '.bam'))
threads: threads
scatter: [output, reads_1]
scatterMethod: dotproduct
out: [output_file]
run: ../../../tools/bwa/0.7.17/bwa_mem.cwl
outputs:
trimmed_reads:
type: File[]
outputSource: trim/trimmed_reads
trim_reads_log:
type: File[]
outputSource: trim/log
quality_html_report:
type: File[]
outputSource: quality/html_report
quality_compressed_report:
type: File[]
outputSource: quality/compressed_report
alignments:
type: File[]
outputSource: align/output_file
Is there anything else I could provide to help solve this issue?
Thanks in advance!
Issue Analytics
- State:
- Created 5 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
@DailyDreaming @mr-c #2221 (not yet merged into master) fixes the graph building error above. Thanks!
I’m good, closing.