question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

WriteLogs fails with Slurm when node cannot find Toil

See original GitHub issue

I stumbled upon this issue, because I tried to run the CWL test suite, which basically means that there’s no installed version of Toil, other than the one in the virtual environment I’m using to run the tests.

How to reproduce

git clone https://github.com/DataBiosphere/toil.git
cd toil
python3 -m venv venv
unset PYTHONPATH
. venv/bin/activate
pip install -U pip wheel
pip install -e .[cwl] cwltest
git clone https://github.com/common-workflow-language/cwl-v1.2/
cd cwl-v1.2
mkdir -p tmp/{working,logs}
toil-cwl-runner --batchSystem=slurm --disableCaching --workDir=$PWD/tmp/working --writeLogs=$PWD/tmp/logs tests/bwa-mem-tool.cwl tests/bwa-mem-job.json

Notes:

  1. The last command is just the first test of the conformance tests suite, executed manually, with slightly different command-liine options.
  2. $PWD must be on a shared disk that is also visible on the compute node.

This command fails with the following error:

[2021-10-08T14:46:38+0200] [MainThread] [I] [cwltool] Resolved 'tests/bwa-mem-tool.cwl' to 'file:///home/rapthor-mloose/code/toil/cwl-v1.2/tests/bwa-mem-tool.cwl'
[2021-10-08T14:46:40+0200] [MainThread] [I] [toil.job] Saving graph of 1 jobs, 1 new
[2021-10-08T14:46:40+0200] [MainThread] [I] [toil.job] Processing job 'CWLJob' python kind-CWLJob/instance-74x417w1 v0
[2021-10-08T14:46:40+0200] [MainThread] [I] [toil] Running Toil version 5.6.0a1-eb2ae8365ae2ebdd50132570b20f7d480eb40cac on host ui-02.spider.surfsara.nl.
[2021-10-08T14:46:40+0200] [MainThread] [I] [toil.leader] Issued job 'CWLJob' python kind-CWLJob/instance-74x417w1 v1 with job batch system ID: 0 and cores: 2, disk: 2.0 Gi, and memory: 256.0 Mi
[2021-10-08T14:46:42+0200] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] Job failed with exit value 127: 'CWLJob' python kind-CWLJob/instance-74x417w1 v1
Exit reason: None
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'CWLJob' python kind-CWLJob/instance-74x417w1 v1
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] The batch system left an empty file /home/rapthor-mloose/code/toil/cwl-v1.2/tmp/working/toil_f1bc8bc0-0c7c-4730-896a-1a999672a46a.0.784924.out.log
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] The batch system left a non-empty file /home/rapthor-mloose/code/toil/cwl-v1.2/tmp/working/toil_f1bc8bc0-0c7c-4730-896a-1a999672a46a.0.784924.err.log:
[2021-10-08T14:46:50+0200] [MainThread] [W] [toil.leader] Log from job "kind-CWLJob/instance-74x417w1" follows:
=========>
	/var/spool/slurmd/job784924/slurm_script: line 4: _toil_worker: command not found
<=========

Workflow Progress 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 (0 failures) [00:10<00:00, 0.10 jobs/s]
Traceback (most recent call last):
  File "/home/rapthor-mloose/code/toil/src/toil/cwl/cwltoil.py", line 3400, in main
    outobj = toil.start(wf1)
  File "/home/rapthor-mloose/code/toil/src/toil/common.py", line 844, in start
    return self._runMainLoop(rootJobDescription)
  File "/home/rapthor-mloose/code/toil/src/toil/common.py", line 1160, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/home/rapthor-mloose/code/toil/src/toil/leader.py", line 248, in run
    self.innerLoop()
  File "/home/rapthor-mloose/code/toil/src/toil/leader.py", line 703, in innerLoop
    self._gatherUpdatedJobs(updatedJobTuple)
  File "/home/rapthor-mloose/code/toil/src/toil/leader.py", line 662, in _gatherUpdatedJobs
    self.processFinishedJob(bsID, exitStatus, wallTime=wallTime, exitReason=exitReason)
  File "/home/rapthor-mloose/code/toil/src/toil/leader.py", line 1137, in processFinishedJob
    StatsAndLogging.writeLogFiles(jobNames, batchSystemFileStream, self.config, failed=True)
  File "/home/rapthor-mloose/code/toil/src/toil/statsAndLogging.py", line 108, in writeLogFiles
    mainFileName = jobNames[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rapthor-mloose/code/toil/venv/bin/toil-cwl-runner", line 11, in <module>
    load_entry_point('toil', 'console_scripts', 'toil-cwl-runner')()
  File "/home/rapthor-mloose/code/toil/src/toil/cwl/cwltoil.py", line 3404, in main
    if getattr(err, "exit_code") == CWL_UNSUPPORTED_REQUIREMENT_EXIT_CODE:
AttributeError: 'IndexError' object has no attribute 'exit_code'

Now, I’m not sure why the first error _toil_worker: command not found occurs, but that error is triggered on the compute node. The Toil runner on the head node then gets confused, because it doesn’t receive from the compute node what it expects.

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-1049

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
gmloosecommented, Oct 8, 2021

is TOIL_SLURM_ARGS set, and what is its value?

Is there an sbatch wrapper script installed that adds an --export= option? https://slurm.schedmd.com/sbatch.html#OPT_export

TOIL_SLURM_ARGS is not set, and sbatch is not wrapped.

$ file `which sbatch`
/usr/bin/sbatch: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=d9aebdcf7df3909dcbac458272298eb186a27f58, with debug_info, not stripped
0reactions
gmloosecommented, Oct 8, 2021

Virtualenv work by setting environment variables like PATH and PYTHONPATH. sbatch and other batch schedulers automatically forward all set environment variables set at job submission time.

That was my understanding as well. However, when I look at the venv/bin/activate script that is generated by the command python3 -m venv venv, I see no PYTHONPATH being set. So, I’m a bit confused how they play that trick. My guess is that they derive the location of site-packages from the location of the python binary. But I could be wrong.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Toil-Job on Slurm failed Β· Issue #2756 - GitHub
I see that a job is spawned, so I assume that the error occurs during the execution. ... Nodes of the west partition...
Read more >
Commandline Options β€” Toil 5.8.0a1 documentation
A value of 1.0 replaces every missing pre-emptable node with a non-preemptable one. --nodeStorage NODESTORAGE. Specify the size of the root volume of...
Read more >
Slurm Troubleshooting Guide
This guide is meant as a tool to help system administrators or operators troubleshoot Slurm failures and restore services.
Read more >
Ubuntu Manpage: toil - Toil Documentation
When we run the pipeline, Toil will show a detailed failure log with a traceback: ... A value of 1.0 replaces every missing...
Read more >
bd2k-genomics-toil/Lobby - Gitter
Yep, it seems I get an argparse error if I put them as the first params. Can open a ticket with the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found