question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SLURM cluster fails with unrecognized option '--parsable'

See original GitHub issue

I’m trying out a Pangeo deployment on our local HPC system at UAlbany, which uses SLURM. Basically following these instructions from the Pangeo documentation.

Running this code from within a Jupyter notebook:

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=4,
                       cores=4,
                       memory="16GB",
                       walltime="01:00:00",
                       queue="snow-1")
cluster.scale(4)

fails with the following:

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2b8d1bfb7bf8>, 4)
Traceback (most recent call last):
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 416, in scale_up
    self.start_workers(n - self._count_active_and_pending_workers())
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 330, in start_workers
    out = self._submit_job(fn)
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 322, in _submit_job
    return self._call(shlex.split(self.submit_command) + [script_filename])
  File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 383, in _call
    cmd_str, out, err))
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch --parsable /tmp/tmpo8zdikq3.sh
stdout:

stderr:
sbatch: unrecognized option '--parsable'
Try "sbatch --help" for more information

I generated the same error message by running

sbatch --parsable

directly on the command line.

It’s possible that this is because we are running a very old version of SLURM:

[br546577@snow-23 ~]$ sbatch --version
slurm 2.5.1

Workarounds?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:19 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
jhammancommented, Jan 30, 2019

@brian-rose - that actually looks right.

Why the workers aren’t starting appears to be unrelated to your original issue. From here, I suggest working through https://dask-jobqueue.readthedocs.io/en/latest/debug.html. In particular, the cluster.job_script() method seems to be very useful for understanding what jobqueue is doing and how it is interfacing with your scheduler.

1reaction
lestevecommented, Jan 29, 2019

It’s possible that this is because we are running a very old version of SLURM:

I googled a bit and found this commit which seems to have reached the 14-03-0-1 release released in Mar 26, 2014.

The first thing I would suggest is asking your sys-admin whether there is a slight chance to update the SLURM install. Maybe unlikely but I guess it’s worth a try.

A work-around in dask-jobqueue would be to not use --parsable and get the job id from the stdout produced by sbatch the_temporary_script.sh + a regex. A PR doing that would be more than welcome!

You may want to look at https://github.com/dask/dask-jobqueue/pull/45 that added --parsable in dask-jobqueue and the reasons that motivated the change. IIRC the main reason was: it is cleaner to avoid post-treatment in dask-jobqueue as much as possible, but of course we did not imagine that --parsable was a problem in very old SLURM installs …

Another thing to draw inspiration from IPython.parallel and how they get the jobid from the submit command output in: https://github.com/ipython/ipyparallel/blob/6.1.1/ipyparallel/apps/launcher.py.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[slurm-users] Issue with AccountingStoreFlags after SLURM ...
After I removed the invalid AccountingStoreFlags option and restarted the SLURM daemons on all nodes the jobs got rescheduled, but now all nodes ......
Read more >
sbatch - Slurm Workload Manager
Slurm will attempt to submit a sibling job to a cluster if it has at least one of the specified features. If the...
Read more >
[slurm-dev] "unrecognized key: OverSubscribe" for partition
Hello, I am running a small cluster, and recently we wanted to enable the OverSubscribe option for the default partition in order to...
Read more >
Setting up slurm.conf file for single computer - Stack Overflow
Hi I am attempting to utilize a processing pipeline which is written to run on multiple computer clusters using slurm however I ...
Read more >
Everything You Need to Know about Using Slurm on Quest
sbatch --parsable <name_of_script> 549005. If there is an error in your job submission script, the job will not be accepted by the scheduler ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found