WriteLogs fails with Slurm when node cannot find Toil
See original GitHub issueI stumbled upon this issue, because I tried to run the CWL test suite, which basically means that thereβs no installed version of Toil, other than the one in the virtual environment Iβm using to run the tests.
How to reproduce
git clone https://github.com/DataBiosphere/toil.git
cd toil
python3 -m venv venv
unset PYTHONPATH
. venv/bin/activate
pip install -U pip wheel
pip install -e .[cwl] cwltest
git clone https://github.com/common-workflow-language/cwl-v1.2/
cd cwl-v1.2
mkdir -p tmp/{working,logs}
toil-cwl-runner --batchSystem=slurm --disableCaching --workDir=$PWD/tmp/working --writeLogs=$PWD/tmp/logs tests/bwa-mem-tool.cwl tests/bwa-mem-job.json
Notes:
- The last command is just the first test of the conformance tests suite, executed manually, with slightly different command-liine options.
$PWD
must be on a shared disk that is also visible on the compute node.
This command fails with the following error:
[2021-10-08T14:46:38+0200] [MainThread] [I] [cwltool] Resolved 'tests/bwa-mem-tool.cwl' to 'file:///home/rapthor-mloose/code/toil/cwl-v1.2/tests/bwa-mem-tool.cwl'
[2021-10-08T14:46:40+0200] [MainThread] [I] [toil.job] Saving graph of 1 jobs, 1 new
[2021-10-08T14:46:40+0200] [MainThread] [I] [toil.job] Processing job 'CWLJob' python kind-CWLJob/instance-74x417w1 v0
[2021-10-08T14:46:40+0200] [MainThread] [I] [toil] Running Toil version 5.6.0a1-eb2ae8365ae2ebdd50132570b20f7d480eb40cac on host ui-02.spider.surfsara.nl.
[2021-10-08T14:46:40+0200] [MainThread] [I] [toil.leader] Issued job 'CWLJob' python kind-CWLJob/instance-74x417w1 v1 with job batch system ID: 0 and cores: 2, disk: 2.0 Gi, and memory: 256.0 Mi
[2021-10-08T14:46:42+0200] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] Job failed with exit value 127: 'CWLJob' python kind-CWLJob/instance-74x417w1 v1
Exit reason: None
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'CWLJob' python kind-CWLJob/instance-74x417w1 v1
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] The batch system left an empty file /home/rapthor-mloose/code/toil/cwl-v1.2/tmp/working/toil_f1bc8bc0-0c7c-4730-896a-1a999672a46a.0.784924.out.log
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] The batch system left a non-empty file /home/rapthor-mloose/code/toil/cwl-v1.2/tmp/working/toil_f1bc8bc0-0c7c-4730-896a-1a999672a46a.0.784924.err.log:
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] Log from job "kind-CWLJob/instance-74x417w1" follows:
=========>
/var/spool/slurmd/job784924/slurm_script: line 4: _toil_worker: command not found
<=========
Workflow Progress 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 (0 failures) [00:10<00:00, 0.10 jobs/s]
Traceback (most recent call last):
File "/home/rapthor-mloose/code/toil/src/toil/cwl/cwltoil.py", line 3400, in main
outobj = toil.start(wf1)
File "/home/rapthor-mloose/code/toil/src/toil/common.py", line 844, in start
return self._runMainLoop(rootJobDescription)
File "/home/rapthor-mloose/code/toil/src/toil/common.py", line 1160, in _runMainLoop
jobCache=self._jobCache).run()
File "/home/rapthor-mloose/code/toil/src/toil/leader.py", line 248, in run
self.innerLoop()
File "/home/rapthor-mloose/code/toil/src/toil/leader.py", line 703, in innerLoop
self._gatherUpdatedJobs(updatedJobTuple)
File "/home/rapthor-mloose/code/toil/src/toil/leader.py", line 662, in _gatherUpdatedJobs
self.processFinishedJob(bsID, exitStatus, wallTime=wallTime, exitReason=exitReason)
File "/home/rapthor-mloose/code/toil/src/toil/leader.py", line 1137, in processFinishedJob
StatsAndLogging.writeLogFiles(jobNames, batchSystemFileStream, self.config, failed=True)
File "/home/rapthor-mloose/code/toil/src/toil/statsAndLogging.py", line 108, in writeLogFiles
mainFileName = jobNames[0]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/rapthor-mloose/code/toil/venv/bin/toil-cwl-runner", line 11, in <module>
load_entry_point('toil', 'console_scripts', 'toil-cwl-runner')()
File "/home/rapthor-mloose/code/toil/src/toil/cwl/cwltoil.py", line 3404, in main
if getattr(err, "exit_code") == CWL_UNSUPPORTED_REQUIREMENT_EXIT_CODE:
AttributeError: 'IndexError' object has no attribute 'exit_code'
Now, Iβm not sure why the first error _toil_worker: command not found
occurs, but that error is triggered on the compute node. The Toil runner on the head node then gets confused, because it doesnβt receive from the compute node what it expects.
βIssue is synchronized with this Jira Task βIssue Number: TOIL-1049
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Toil-Job on Slurm failed Β· Issue #2756 - GitHub
I see that a job is spawned, so I assume that the error occurs during the execution. ... Nodes of the west partition...
Read more >Commandline Options β Toil 5.8.0a1 documentation
A value of 1.0 replaces every missing pre-emptable node with a non-preemptable one. --nodeStorage NODESTORAGE. Specify the size of the root volume of...
Read more >Slurm Troubleshooting Guide
This guide is meant as a tool to help system administrators or operators troubleshoot Slurm failures and restore services.
Read more >Ubuntu Manpage: toil - Toil Documentation
When we run the pipeline, Toil will show a detailed failure log with a traceback: ... A value of 1.0 replaces every missing...
Read more >bd2k-genomics-toil/Lobby - Gitter
Yep, it seems I get an argparse error if I put them as the first params. Can open a ticket with the ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
TOIL_SLURM_ARGS
is not set, andsbatch
is not wrapped.That was my understanding as well. However, when I look at the
venv/bin/activate
script that is generated by the commandpython3 -m venv venv
, I see noPYTHONPATH
being set. So, Iβm a bit confused how they play that trick. My guess is that they derive the location ofsite-packages
from the location of thepython
binary. But I could be wrong.