Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Toil fails to properly handle spurious Slurm database connection time-out

See original GitHub issue

On our cluster, I see spurious Slurm database connection time-outs occur, when running a Toil job. Here’s an example of the exception that occurs:

sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:batch-01:6819: Connec
tion timed out
sacct: error: Sending PersistInit msg: Connection timed out
sacct: error: Problem talking to the database: Connection timed out
[2021-09-03T02:11:29+0200] [Thread-166] [E] [toil.batchSystems.abstractGridEngineBatchSystem] GridEngine like batch sy
stem failure
Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 79, in getJobE
xitCode
    state, rc = self._getJobDetailsFromSacct(slurmJobID)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 100, in _getJo
bDetailsFromSacct
    stdout = call_command(args)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/misc.py", line 67, in call_command
    raise CalledProcessErrorStderr(proc.returncode, cmd, output=stdout, stderr=stderr)
toil.lib.misc.CalledProcessErrorStderr: Command '['sacct', '-n', '-j', '682747', '--format', 'State,ExitCode', '-P', '
-S', '1970-01-01']' exit status 1: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent conne
ction to host:batch-01:6819: Connection timed out
sacct: error: Sending PersistInit msg: Connection timed out
sacct: error: Problem talking to the database: Connection timed out


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.
py", line 252, in run
    while self._runStep():
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.
py", line 242, in _runStep
    activity |= self.checkOnJobs()
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.
py", line 200, in checkOnJobs
    status = self.boss.with_retries(self.getJobExitCode, batch_job_id)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.
py", line 473, in with_retries
    return operation(*args, **kwargs)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 82, in getJobE
xitCode
    state, rc = self._getJobDetailsFromScontrol(slurmJobID)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 145, in _getJobDetailsFromScontrol
    job[bits[0]] = bits[1]
IndexError: list index out of range
Exception in thread Thread-166:
Traceback (most recent call last):
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 79, in getJobExitCode
    state, rc = self._getJobDetailsFromSacct(slurmJobID)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/batchSystems/slurm.py", line 100, in _getJobDetailsFromSacct
    stdout = call_command(args)
  File "/project/rapthor/Software/rapthor/lib/python3.6/site-packages/toil/lib/misc.py", line 67, in call_command
    raise CalledProcessErrorStderr(proc.returncode, cmd, output=stdout, stderr=stderr)
toil.lib.misc.CalledProcessErrorStderr: Command '['sacct', '-n', '-j', '682747', '--format', 'State,ExitCode', '-P', '-S', '1970-01-01']' exit status 1: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:batch-01:6819: Connection timed out
sacct: error: Sending PersistInit msg: Connection timed out
sacct: error: Problem talking to the database: Connection timed out

When this happens, Toil doesn’t retry, and somehow gets confused about the state of the running job. As a result, the workflow hangs, issuing a message like this every hour:

[2021-09-03T02:59:49+0200] [MainThread] [I] [toil.leader] 0 jobs are running, 697 jobs are issued and waiting to run

However, the job it’s waiting for, has by now finished, which can be checked:

$ sacct -n -j 682747 --format State,ExitCode -P -S 1970-01-01
COMPLETED|0:0
COMPLETED|0:0
COMPLETED|0:0

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-1005

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

gmloosecommented, Sep 22, 2021

I think this issue can be closed. The actual cause of the error was not

Thanks @gmloose for your report. Looks like we need to teach

https://github.com/DataBiosphere/toil/blob/1393068bd77f99aeae9faa9a10e20b54396b13e0/src/toil/batchSystems/slurm.py#L100 about this error.

The call to getJobExitCode(), which calls _getJobDetailsFromSacct() is already wrapped in a retry: https://github.com/DataBiosphere/toil/blob/1393068bd77f99aeae9faa9a10e20b54396b13e0/src/toil/batchSystems/abstractGridEngineBatchSystem.py#L164 The real cause was the IndexError that occurred in _getJobDetailsFromScontrol(), which has been fixed in my PR.

1reaction

gmloosecommented, Sep 17, 2021

The original exception message got me puzzled. Why does it fail on an IndexError? This means that the scontrol command must have succeeded, otherwise you would have expected a CalledProcessErrorStderr (again). So, I figured out that the parsing of the output of scontrol is broken. I will try to fix this and create a PR for it.

Top Results From Across the Web

sacct: error: slurm_persist_conn_open_without_init: failed to ...

manager with *sacct* i am getting below error. [root@smaster ~]# sacct sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection ...

jobs > 100 nodes, slurmstepd timeout, _send_launch_resp ...

However it is intermittent at job sizes of 100 nodes or lower. Additionally, we occasionally observe, another 'Connection timed out' when a job ......

Untitled

Number ContainerName DateLastTouched URL 1 2pg_cartesian 2016_08_26__21_27_36 URL 2 abricate 2016_09_05__13_10_17 URL 3 abundancebin 2016_11_13__12_00_44 URL

Compare Packages Between Distributions - DistroWatch.com

async-timeout 4.0.2 asyncstdlib 3.10.5 at 3.2.5 ... clashscore-db 3.17 class_loader 0.5.0 ... gnome-shell-extension-bluetooth-quick-connect 30

Available software - CC Doc

Except for basic system programs, you access most software by loading a module. See Using modules for more on how to use the...