toil-cwl-runner on slurm fails to run jobs on multiple nodes
See original GitHub issueI have been trying to run a CWL workflow with toil and Slurm.
I do so by running an sbatch script containing toil-cwl-runner
.
When the job runs entirely on one Slurm node it succeeds.
If slurm chooses to run _toil_worker
on a different node than the node running toil.leader
the _toil_worker
job fails and the following is included in toil-cwl-runner
output:
MainThread WARNING toil.leader: No log file is present, despite job failing...
If you look at the failed job with sacct
the State
is FAILED
and ExitCode
is 1:0
.
I have not been able to find any output logs on the nodes attempting to run _toil_worker
.
I have been using toil from the master branch.
NOTE: If you specify --workDir
with a path shared across all nodes this problem does not occur.
I added some logging and found that the sbatch
command issued by toil.leader
is attempting to write stderr and stdout to files within a directory in /tmp that does not exist on the node trying to run _toil_worker
:
sbatch
...
-o /tmp/toil-882d58aa-9bc5-4faa-abf5-4906fc65e8af-8c697123-fa43-4d9b-9948-405a15a8957d/toil_job_0_batch_slurm_%j_std_output.log
-e /tmp/toil-882d58aa-9bc5-4faa-abf5-4906fc65e8af-8c697123-fa43-4d9b-9948-405a15a8957d/toil_job_0_batch_slurm_%j_std_error.log
--wrap=_toil_worker...
This /tmp/toil-<uuid>/
directory does get created on the toil.leader
node.
I think the following code specifies the file paths for the -o
and -e
arguments:
https://github.com/DataBiosphere/toil/blob/81a543d0e1a8e6c299f22bb8e862d34097e1f0bc/src/toil/batchSystems/slurm.py#L190-L191
Reproducing
This requires a slurm cluster and toil installed from the master branch (to get another bug fix).
Identify two nodes you wish run toil.leader
and _toil_worker
on which I will refer to as c1-leader
and c1-worker
from now on.
Create a sbatch script toil.sbatch
with the following contents:
#!/usr/bin/env bash
source env/bin/activate
export TOIL_SLURM_ARGS="-w c1-worker"
time toil-cwl-runner \
--jobStore ${HOME}/jobstore \
--batchSystem slurm \
--outdir results \
tar.cwl tar-job.yml
Follow the steps from https://www.commonwl.org/user_guide/04-output/index.html but instead of running
cwl-runner tar.cwl tar-job.yml
run
sbatch -w c1-leader toil.sbatch
This should run toil-cwl-runner
on the c1-leader
node and _toil_worker
on the c1-worker node
. You should see the job fails with No log file is present, despite job failing.
in toil-cwl-runner’s output.
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-452
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (10 by maintainers)
Top GitHub Comments
@adamnovak @DailyDreaming I retested the problem and it worked on toil 4.0. As far as I am concerned this problem is resolved.
Hooray! Glad to hear it.