SLURM cluster fails with unrecognized option '--parsable'
See original GitHub issueI’m trying out a Pangeo deployment on our local HPC system at UAlbany, which uses SLURM. Basically following these instructions from the Pangeo documentation.
Running this code from within a Jupyter notebook:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=4,
cores=4,
memory="16GB",
walltime="01:00:00",
queue="snow-1")
cluster.scale(4)
fails with the following:
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2b8d1bfb7bf8>, 4)
Traceback (most recent call last):
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 416, in scale_up
self.start_workers(n - self._count_active_and_pending_workers())
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 330, in start_workers
out = self._submit_job(fn)
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 322, in _submit_job
return self._call(shlex.split(self.submit_command) + [script_filename])
File "/network/rit/home/br546577/miniconda3/envs/pangeo/lib/python3.6/site-packages/dask_jobqueue/core.py", line 383, in _call
cmd_str, out, err))
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch --parsable /tmp/tmpo8zdikq3.sh
stdout:
stderr:
sbatch: unrecognized option '--parsable'
Try "sbatch --help" for more information
I generated the same error message by running
sbatch --parsable
directly on the command line.
It’s possible that this is because we are running a very old version of SLURM:
[br546577@snow-23 ~]$ sbatch --version
slurm 2.5.1
Workarounds?
Issue Analytics
- State:
- Created 5 years ago
- Comments:19 (12 by maintainers)
Top Results From Across the Web
[slurm-users] Issue with AccountingStoreFlags after SLURM ...
After I removed the invalid AccountingStoreFlags option and restarted the SLURM daemons on all nodes the jobs got rescheduled, but now all nodes ......
Read more >sbatch - Slurm Workload Manager
Slurm will attempt to submit a sibling job to a cluster if it has at least one of the specified features. If the...
Read more >[slurm-dev] "unrecognized key: OverSubscribe" for partition
Hello, I am running a small cluster, and recently we wanted to enable the OverSubscribe option for the default partition in order to...
Read more >Setting up slurm.conf file for single computer - Stack Overflow
Hi I am attempting to utilize a processing pipeline which is written to run on multiple computer clusters using slurm however I ...
Read more >Everything You Need to Know about Using Slurm on Quest
sbatch --parsable <name_of_script> 549005. If there is an error in your job submission script, the job will not be accepted by the scheduler ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@brian-rose - that actually looks right.
Why the workers aren’t starting appears to be unrelated to your original issue. From here, I suggest working through https://dask-jobqueue.readthedocs.io/en/latest/debug.html. In particular, the
cluster.job_script()
method seems to be very useful for understanding what jobqueue is doing and how it is interfacing with your scheduler.I googled a bit and found this commit which seems to have reached the 14-03-0-1 release released in Mar 26, 2014.
The first thing I would suggest is asking your sys-admin whether there is a slight chance to update the SLURM install. Maybe unlikely but I guess it’s worth a try.
A work-around in
dask-jobqueue
would be to not use--parsable
and get the job id from the stdout produced bysbatch the_temporary_script.sh
+ a regex. A PR doing that would be more than welcome!You may want to look at https://github.com/dask/dask-jobqueue/pull/45 that added
--parsable
indask-jobqueue
and the reasons that motivated the change. IIRC the main reason was: it is cleaner to avoid post-treatment indask-jobqueue
as much as possible, but of course we did not imagine that--parsable
was a problem in very old SLURM installs …Another thing to draw inspiration from
IPython.parallel
and how they get the jobid from the submit command output in: https://github.com/ipython/ipyparallel/blob/6.1.1/ipyparallel/apps/launcher.py.