question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

toil-cwl-runner on slurm fails to run jobs on multiple nodes

See original GitHub issue

I have been trying to run a CWL workflow with toil and Slurm. I do so by running an sbatch script containing toil-cwl-runner. When the job runs entirely on one Slurm node it succeeds. If slurm chooses to run _toil_worker on a different node than the node running toil.leader the _toil_worker job fails and the following is included in toil-cwl-runner output:

MainThread WARNING toil.leader: No log file is present, despite job failing...

If you look at the failed job with sacct the State is FAILED and ExitCode is 1:0. I have not been able to find any output logs on the nodes attempting to run _toil_worker. I have been using toil from the master branch.

NOTE: If you specify --workDir with a path shared across all nodes this problem does not occur.


I added some logging and found that the sbatch command issued by toil.leader is attempting to write stderr and stdout to files within a directory in /tmp that does not exist on the node trying to run _toil_worker:

sbatch 
...
-o /tmp/toil-882d58aa-9bc5-4faa-abf5-4906fc65e8af-8c697123-fa43-4d9b-9948-405a15a8957d/toil_job_0_batch_slurm_%j_std_output.log
 -e /tmp/toil-882d58aa-9bc5-4faa-abf5-4906fc65e8af-8c697123-fa43-4d9b-9948-405a15a8957d/toil_job_0_batch_slurm_%j_std_error.log
--wrap=_toil_worker...

This /tmp/toil-<uuid>/ directory does get created on the toil.leader node.

I think the following code specifies the file paths for the -o and -e arguments: https://github.com/DataBiosphere/toil/blob/81a543d0e1a8e6c299f22bb8e862d34097e1f0bc/src/toil/batchSystems/slurm.py#L190-L191

sbatch documentation

Reproducing

This requires a slurm cluster and toil installed from the master branch (to get another bug fix). Identify two nodes you wish run toil.leader and _toil_worker on which I will refer to as c1-leader and c1-worker from now on.

Create a sbatch script toil.sbatch with the following contents:

#!/usr/bin/env bash

source env/bin/activate
export TOIL_SLURM_ARGS="-w c1-worker"

time toil-cwl-runner \
  --jobStore ${HOME}/jobstore \
  --batchSystem slurm \
  --outdir results \
  tar.cwl tar-job.yml 

Follow the steps from https://www.commonwl.org/user_guide/04-output/index.html but instead of running

cwl-runner tar.cwl tar-job.yml

run

sbatch -w c1-leader toil.sbatch

This should run toil-cwl-runner on the c1-leader node and _toil_worker on the c1-worker node. You should see the job fails with No log file is present, despite job failing. in toil-cwl-runner’s output.

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-452

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
johnbradleycommented, Apr 16, 2020

@adamnovak @DailyDreaming I retested the problem and it worked on toil 4.0. As far as I am concerned this problem is resolved.

0reactions
adamnovakcommented, Apr 16, 2020

Hooray! Glad to hear it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[slurm-users] Submitting jobs across multiple nodes fails
I have seen some mpi issues that were resolved with that. srun: launch/slurm: launch_p_step_launch: StepId=864.0 aborted before step completely launched. srun: ...
Read more >
Frequently Asked Questions - Slurm Workload Manager
Why is my job not running? Why does the srun --overcommit option not permit multiple jobs to run on nodes? Why is my...
Read more >
Slurm can't run more than one sbatch task - Stack Overflow
I am running GPU jobs and have confirmed I can get multiple jobs running on multiple GPUs with srun, up to the number...
Read more >
Toil-Job on Slurm failed · Issue #2756 - GitHub
I try to execute a simple code example from toil documentation. Without the batchSystem flag everything works fine and produces output on ...
Read more >
Slurm guide for multiple queue mode - AWS ParallelCluster
The above process is repeated until the job can run on an available node without a failure occurring. There are two timing parameters...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found