Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

This job was passed a promise that wasn't yet resolved when it ran

See original GitHub issue

Hi Toil team,

I’m getting random pipeline failures in about 20-30% of pipeline runs. When it happens a restart will consistently fail the pipeline run but a relaunch with the same inputs usually succeeds. I know how hard is to debug non-reproducible issues, but I’m hoping this has come up before.

I’m using Toil 3.15.0. My pipeline is written in CWL. Because of local limitations I’m running with --no-container. Jobs are running in a local slurm HPC with these arguments:

cwltoil \
  --no-container \
  --batchSystem slurm \
  --disableCaching \
  --retryCount 1 \
  --defaultCores 4 \
  --defaultMemory 20G \
  --logFile pipeline/run.log \
  --jobStore pipeline/run.jobstore \
  --workDir /tmp \
  --outdir output/run \
  pipeline/workflow.cwl \
  pipeline/run-job.json

This is the crash log:

Job ended successfully: 'file:///mnt/pipeline/tools/samtools/1.7/samtools_merge.cwl' samtools merge i/F/jobvn7O2e
The job seems to have left a log file, indicating failure: 'ResolveIndirect' 4/0/jobajDurU
4/0/jobajDurU    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
4/0/jobajDurU    INFO:toil:Running Toil version 3.14.0-b91dbf9bf6116879952f0a70f9a2fbbcae7e51b6.
4/0/jobajDurU    Traceback (most recent call last):
4/0/jobajDurU      File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/worker.py", line 294, in workerScript
4/0/jobajDurU        job = Job._loadJob(jobGraph.command, jobStore)
4/0/jobajDurU      File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 907, in _loadJob
4/0/jobajDurU        return cls._unpickle(userModule, fileHandle, jobStore.config)
4/0/jobajDurU      File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 936, in _unpickle
4/0/jobajDurU        runnable = unpickler.load()
4/0/jobajDurU      File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1740, in __new__
4/0/jobajDurU        return cls._resolve(*args)
4/0/jobajDurU      File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1752, in _resolve
4/0/jobajDurU        value = pickle.load(fileHandle)
4/0/jobajDurU      File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1740, in __new__
4/0/jobajDurU        return cls._resolve(*args)
4/0/jobajDurU      File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1752, in _resolve
4/0/jobajDurU        value = pickle.load(fileHandle)
4/0/jobajDurU      File "/mnt/pipeline/env/lib/python2.7/site-packages/toil/job.py", line 1832, in __setstate__
4/0/jobajDurU        "this job as a child/follow-on of {jobName}.".format(jobName=jobName))
4/0/jobajDurU    RuntimeError: This job was passed a promise that wasn't yet resolved when it ran. The job 'CWLGather' that fulfills this promise hasn't yet finished. This means that there aren't enough constraints to ensure the current job always runs after 'CWLGather'. Consider adding a follow-on indirection between this job and its parent, or adding this job as a child/follow-on of 'CWLGather'.
4/0/jobajDurU    ERROR:toil.worker:Exiting the worker because of a failed job on host hpcc-1
4/0/jobajDurU    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'ResolveIndirect' 4/0/jobajDurU with ID 4/0/jobajDurU to 1
<...snip...>
Job 'ResolveIndirect' 4/0/jobajDurU with ID 4/0/jobajDurU is completely failed
No jobs left to run so exiting.
Finished the main loop
Waiting for service manager thread to finish ...
... finished shutting down the service manager. Took 0.730870962143 seconds
Waiting for stats and logging collator thread to finish ...
... finished collating stats and logs. Took 0.399125814438 seconds
Finished toil run with 2 failed jobs
Failed jobs at end of the run: 'CWLScatter' M/D/jobopIemN 'ResolveIndirect' 4/0/jobajDurU

All the crashes happen at exactly the same spot. The jobstore is NFS mounted on all nodes, while the workdir is a local scratch folder. The task failing is an input merging step after a sub-workflow step which has a scatter step. I wonder if having the scatter step inside the sub-workflow might be what is confusing Toil. Here are the CWL sections I think might be relevant:

sample_processing.cwl

steps:
  process_reads:
    in:
      cutadapt_threads: cutadapt_threads
      i5_adapter: i5_adapter
      i7_adapter: i7_adapter
      index_base: reference_file
      minimum_length: minimum_length
      read_group:
        source: sample
        valueFrom: $(self.read_group)
      reads:
        source: sample
        valueFrom: $(self.reads)
      sample_name:
        source: sample
        valueFrom: $(self.sample_name)
    out:
      - trimmed_reads
      - trim_reads_log
      - quality_html_report
      - quality_compressed_report
      - alignments
    run: reads_processing.cwl
  merge:
    in:
      alignment_files: process_reads/alignments
      output_filename:
        source: sample
        valueFrom: "$(self.sample_name).bam"
    out: [output_file]
    run: ../../../tools/samtools/1.7/samtools_merge.cwl

reads_processing.cwl

steps:
  quality:
    in:
      reads: reads
    scatter: reads
    out: [html_report, compressed_report]
    run: ../../../tools/fastqc/0.11.7/fastqc.cwl
  trim:
    in:
      adapter: i7_adapter
      cores: cutadapt_threads
      front: i5_adapter
      minimum_length: minimum_length
      output:
        source: reads
        valueFrom: $(self.basename.replace('.fastq.gz', '.trimmed.fastq.gz'))
      reads: reads
    scatter: [output, reads]
    scatterMethod: dotproduct
    out: [trimmed_reads, log]
    run: ../../../tools/cutadapt/v1.16/cutadapt.cwl
  align:
    in:
      index_base: index_base
      read_group: read_group
      reads_1:
        source: trim/trimmed_reads
      output:
        source: reads
        valueFrom: $(self.basename.replace('.fastq.gz', '.bam'))
      threads: threads
    scatter: [output, reads_1]
    scatterMethod: dotproduct
    out: [output_file]
    run: ../../../tools/bwa/0.7.17/bwa_mem.cwl


outputs:
  trimmed_reads:
    type: File[]
    outputSource: trim/trimmed_reads
  trim_reads_log:
    type: File[]
    outputSource: trim/log
  quality_html_report:
    type: File[]
    outputSource: quality/html_report
  quality_compressed_report:
    type: File[]
    outputSource: quality/compressed_report
  alignments:
    type: File[]
    outputSource: align/output_file

Is there anything else I could provide to help solve this issue?

Thanks in advance!