SLURM cluster only schedules one task on 20 workers (19 idle)
See original GitHub issueThe general idea is to use dask to schedule an embarrassingly parallel problem where each task requires 8 cores (is threaded via OpenMP). That means that one worker should only take one task.
This started here #181 - I’m now running my cluster like this
cluster = SLURMCluster(walltime='01:00:00', memory='7 GB',
job_extra=['--nodes=1', '--ntasks-per-node=1', '--cpus-per-task=8'], cores=8, extra=['--resources processes=1'])
client = Client(cluster)
resulting in
#!/bin/bash
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=8
#SBATCH --mem=7G
#SBATCH -t 01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
JOB_ID=${SLURM_JOB_ID%;*}
/home/wek224/.conda/envs/tardis3/bin/python -m distributed.cli.dask_worker tcp://172.16.2.152:45751 --nthreads 8 --memory-limit 7.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60 --resources processes=1
I define the following task:
def test_task(param_id):
# cur_uuid = str(uuid.uuid4())
cur_uuid = param_id
import time
print("\n\n################### STARTING NEW TASK ##############", cur_uuid, '#########')
for i in range(12):
print(cur_uuid, i, 30)
time.sleep(5)
return param_id
I submit jobs using this command: futures = [client.submit(test_task, param_id, resources={'processes':1}) for param_id in range(10000)]
But it seems that only one worker is actually doing anything while the other workers are completely idle (tail-fing the slurm out files)
Issue Analytics
- State:
- Created 5 years ago
- Comments:36 (15 by maintainers)
Top Results From Across the Web
How to handle job migration of 3rd party tasks? - Dask Forum
SLURM cluster only schedules one task on 20 workers (19 idle) ... This is generally done by passing arguments to the Cluster to...
Read more >Quick Start Administrator Guide - Slurm Workload Manager
Make sure that all nodes in your cluster have the same munge.key. Make sure the MUNGE daemon, munged, is started before you start...
Read more >Slurm + drake: free resources of idle job array workers for ...
For tasks 71-120 (after 1 - 70 completed), I have 50 active workers and 20 idle workers. The idle workers will not do...
Read more >SLURM Scheduler - Center for High Performance Computing
There is a hard limit of maximum 72 hours for jobs on general cluster nodes and 14 days on owner cluster nodes. You...
Read more >How-to: Scheduling via SLURM - University at Albany
To schedule research computing work using SLURM, follow the instructions ... Note that a job can only request 3 nodes and may only...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@wkerzendorf you have to help us help you 😉 ! Without a stand-alone snippet we are reduced to wild guesses which is not the most optimal way of spending our time …
I strongly suggest:
OK so let’s start with something simple, single-core jobs with simple Python function. Can you run this in your notebook and post the output you get?
Closing this issue as stale, and there are a lot of different problem in it. The last one raised is the adaptative scalign using resources, but if someone encounters it again, we should open a new issue.