Add RemoteSlurmJob to connect SLURMCluster to a remote Slurm cluster
See original GitHub issueHello 👋 Thank you for considering this feature request 😃 I have been looking over dask-jobqueue (together with prefect) to allocate resources on a Slurm cluster I have access to. dask-jobqueue seems exactly what we’d need for this, thank you for maintaining it 🙇
Context
In the case where Slurm and a python process (script or notebook) are not running on the same host SlurmCluster will not be able to spawn any jobs and error.
Slurm added a REST API: https://slurm.schedmd.com/rest_api.html
Feature
Could we add a RemoteSlurmCluster
and RemoteSlurmJob
that largely extend SlurmCluster/SlurmJob and instead of using subprocess.Popen
we’d do an HTTP request instead?
class RemoteSLURMJob(SLURMJob):
@contextmanager
def job_file(self):
# we don't need the script file but only the script itself. to not alter `async def start(self):` we yield the script here.
yield self.job_script()
async def _submit_job(self, script):
# formatting should be according to:
# https://slurm.schedmd.com/rest_api.html#slurmctldSubmitJob
return requests.post('slurm-url/jobs/submit') # reach out to API
def _job_id_from_submit_output(self, out):
# out is the JSON output from _submit_job request.post
# See https://slurm.schedmd.com/rest_api.html#v0.0.36_job_submission_response
return out['job_id']
@classmethod
def _close_job(cls, job_id):
# See: https://slurm.schedmd.com/rest_api.html#slurmctldCancelJob
requests.delete(f'slurm-url/job/{job_id}')
class RemoteSLURMCluster(SLURMCluster):
job_cls = RemoteSLURMJob
As far as I can tell this should be a drop-in replacement.
Thoughts? (tagging @mrocklin @lesteve for visibility, hope you would have time for a review.)
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
Yes - or at least somewhere where it can send/receive calls from the SLURM REST API. I would say “inside” the cluster.
This could be of interest as well: https://gist.github.com/willirath/2176a9fa792577b269cb393995f43dda
It’s ssh’ing back to the host system where srun etc are available.