Toil fails to properly handle spurious Slurm database connection time-out
See original GitHub issueOn our cluster, I see spurious Slurm database connection time-outs occur, when running a Toil job. Here’s an example of the exception that occurs:
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:batch-01:6819: Connec
tion timed out
sacct: error: Sending PersistInit msg: Connection timed out
sacct: error: Problem talking to the database: Connection timed out
[2021-09-03T02:11:29+0200] [Thread-166] [E] [toil.batchSystems.abstractGridEngineBatchSystem] GridEngine like batch sy
stem failure
Traceback (most recent call last):
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 79, in getJobE
xitCode
state, rc = self._getJobDetailsFromSacct(slurmJobID)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 100, in _getJo
bDetailsFromSacct
stdout = call_command(args)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/misc.py", line 67, in call_command
raise CalledProcessErrorStderr(proc.returncode, cmd, output=stdout, stderr=stderr)
toil.lib.misc.CalledProcessErrorStderr: Command '['sacct', '-n', '-j', '682747', '--format', 'State,ExitCode', '-P', '
-S', '1970-01-01']' exit status 1: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent conne
ction to host:batch-01:6819: Connection timed out
sacct: error: Sending PersistInit msg: Connection timed out
sacct: error: Problem talking to the database: Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.
py", line 252, in run
while self._runStep():
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.
py", line 242, in _runStep
activity |= self.checkOnJobs()
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.
py", line 200, in checkOnJobs
status = self.boss.with_retries(self.getJobExitCode, batch_job_id)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.
py", line 473, in with_retries
return operation(*args, **kwargs)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 82, in getJobE
xitCode
state, rc = self._getJobDetailsFromScontrol(slurmJobID)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 145, in _getJobDetailsFromScontrol
job[bits[0]] = bits[1]
IndexError: list index out of range
Exception in thread Thread-166:
Traceback (most recent call last):
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 79, in getJobExitCode
state, rc = self._getJobDetailsFromSacct(slurmJobID)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 100, in _getJobDetailsFromSacct
stdout = call_command(args)
File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/misc.py", line 67, in call_command
raise CalledProcessErrorStderr(proc.returncode, cmd, output=stdout, stderr=stderr)
toil.lib.misc.CalledProcessErrorStderr: Command '['sacct', '-n', '-j', '682747', '--format', 'State,ExitCode', '-P', '-S', '1970-01-01']' exit status 1: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:batch-01:6819: Connection timed out
sacct: error: Sending PersistInit msg: Connection timed out
sacct: error: Problem talking to the database: Connection timed out
When this happens, Toil doesn’t retry, and somehow gets confused about the state of the running job. As a result, the workflow hangs, issuing a message like this every hour:
[2021-09-03T02:59:49+0200] [MainThread] [I] [toil.leader] 0 jobs are running, 697 jobs are issued and waiting to run
However, the job it’s waiting for, has by now finished, which can be checked:
$ sacct -n -j 682747 --format State,ExitCode -P -S 1970-01-01
COMPLETED|0:0
COMPLETED|0:0
COMPLETED|0:0
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-1005
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
sacct: error: slurm_persist_conn_open_without_init: failed to ...
manager with *sacct* i am getting below error. [root@smaster ~]# sacct sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection ...
Read more >jobs > 100 nodes, slurmstepd timeout, _send_launch_resp ...
However it is intermittent at job sizes of 100 nodes or lower. Additionally, we occasionally observe, another 'Connection timed out' when a job ......
Read more >Untitled
Number ContainerName DateLastTouched URL
1 2pg_cartesian 2016_08_26__21_27_36 URL
2 abricate 2016_09_05__13_10_17 URL
3 abundancebin 2016_11_13__12_00_44 URL
Read more >Compare Packages Between Distributions - DistroWatch.com
async-timeout 4.0.2 asyncstdlib 3.10.5 at 3.2.5 ... clashscore-db 3.17 class_loader 0.5.0 ... gnome-shell-extension-bluetooth-quick-connect 30
Read more >Available software - CC Doc
Except for basic system programs, you access most software by loading a module. See Using modules for more on how to use the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think this issue can be closed. The actual cause of the error was not
The call to
getJobExitCode()
, which calls_getJobDetailsFromSacct()
is already wrapped in a retry: https://github.com/DataBiosphere/toil/blob/1393068bd77f99aeae9faa9a10e20b54396b13e0/src/toil/batchSystems/abstractGridEngineBatchSystem.py#L164 The real cause was theIndexError
that occurred in_getJobDetailsFromScontrol()
, which has been fixed in my PR.The original exception message got me puzzled. Why does it fail on an
IndexError
? This means that thescontrol
command must have succeeded, otherwise you would have expected aCalledProcessErrorStderr
(again). So, I figured out that the parsing of the output ofscontrol
is broken. I will try to fix this and create a PR for it.