question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add RemoteSlurmJob to connect SLURMCluster to a remote Slurm cluster

See original GitHub issue

Hello 👋 Thank you for considering this feature request 😃 I have been looking over dask-jobqueue (together with prefect) to allocate resources on a Slurm cluster I have access to. dask-jobqueue seems exactly what we’d need for this, thank you for maintaining it 🙇

Context

In the case where Slurm and a python process (script or notebook) are not running on the same host SlurmCluster will not be able to spawn any jobs and error.

Slurm added a REST API: https://slurm.schedmd.com/rest_api.html

Feature

Could we add a RemoteSlurmCluster and RemoteSlurmJob that largely extend SlurmCluster/SlurmJob and instead of using subprocess.Popen we’d do an HTTP request instead?

class RemoteSLURMJob(SLURMJob):
    @contextmanager
    def job_file(self):
        # we don't need the script file but only the script itself. to not alter `async def start(self):` we yield the script here.
        yield self.job_script()

    async def _submit_job(self, script):
        # formatting should be according to:
        # https://slurm.schedmd.com/rest_api.html#slurmctldSubmitJob
        return requests.post('slurm-url/jobs/submit')  # reach out to API

    def _job_id_from_submit_output(self, out):
        # out is the JSON output from _submit_job request.post
        # See https://slurm.schedmd.com/rest_api.html#v0.0.36_job_submission_response
        return out['job_id']

    @classmethod
    def _close_job(cls, job_id):
        # See: https://slurm.schedmd.com/rest_api.html#slurmctldCancelJob
        requests.delete(f'slurm-url/job/{job_id}')


class RemoteSLURMCluster(SLURMCluster):
    job_cls = RemoteSLURMJob

As far as I can tell this should be a drop-in replacement.

Thoughts? (tagging @mrocklin @lesteve for visibility, hope you would have time for a review.)

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
AlexanderVanEckcommented, Jul 6, 2021

inside a docker container (on a login node I assume)

Yes - or at least somewhere where it can send/receive calls from the SLURM REST API. I would say “inside” the cluster.

1reaction
willirathcommented, May 6, 2022

This could be of interest as well: https://gist.github.com/willirath/2176a9fa792577b269cb393995f43dda

It’s ssh’ing back to the host system where srun etc are available.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Connect remote scheduler to dask_jobqueue.SLURMCluster
I've got a web app that allows users to run containers and I use dask to manage ... Add RemoteSlurmJob to connect SLURMCluster...
Read more >
Quick Start Administrator Guide - Slurm Workload Manager
Install the configuration file in <sysconfdir>/slurm.conf. NOTE: You will need to install this configuration file on all nodes of the cluster.
Read more >
dask_jobqueue.SLURMCluster - Dask-Jobqueue
Launch Dask on a SLURM cluster. Parameters. queuestr. Destination queue for each worker job. Passed to #SBATCH -p option. projectstr. Deprecated: use ......
Read more >
SLURM: How to submit a job to a remote slurm cluster from ...
If you install the Slurm client on Server B . Copy your slurm.conf to it and then ensure it has the correct authentication...
Read more >
Setting up a DASK cluster using dask-jobqueue
dask_jobqueue.SLURMcluster is going to allow us to submit a job to our scheduler to create the DASK cluster. That cluster is going to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found