Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SLURM cluster only schedules one task on 20 workers (19 idle)

See original GitHub issue

The general idea is to use dask to schedule an embarrassingly parallel problem where each task requires 8 cores (is threaded via OpenMP). That means that one worker should only take one task.

This started here #181 - I’m now running my cluster like this

cluster = SLURMCluster(walltime='01:00:00', memory='7 GB', 
                       job_extra=['--nodes=1', '--ntasks-per-node=1', '--cpus-per-task=8'], cores=8, extra=['--resources processes=1'])
client = Client(cluster)

resulting in

#!/bin/bash

#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=8
#SBATCH --mem=7G
#SBATCH -t 01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
JOB_ID=${SLURM_JOB_ID%;*}



/home/wek224/.conda/envs/tardis3/bin/python -m distributed.cli.dask_worker tcp://172.16.2.152:45751 --nthreads 8 --memory-limit 7.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60 --resources processes=1

I define the following task:

def test_task(param_id):
    # cur_uuid = str(uuid.uuid4())
    cur_uuid = param_id
    import time
    print("\n\n################### STARTING NEW TASK ##############", cur_uuid, '#########')
    for i in range(12):
        print(cur_uuid, i, 30)
        time.sleep(5)
    return param_id

I submit jobs using this command: futures = [client.submit(test_task, param_id, resources={'processes':1}) for param_id in range(10000)]

But it seems that only one worker is actually doing anything while the other workers are completely idle (tail-fing the slurm out files)

Issue Analytics

State:
Created 5 years ago
Comments:36 (15 by maintainers)

Top GitHub Comments

1reaction

lestevecommented, Feb 4, 2019

@wkerzendorf you have to help us help you 😉 ! Without a stand-alone snippet we are reduced to wild guesses which is not the most optimal way of spending our time …

I strongly suggest:

we look at one problem at a time in this issue. My understanding is that the original question was about the following: only part of the workers are used to run the computation, the rest of the workers do nothing. Correct me if I misunderstood so we can make sure we are all on the same page.
we start with something simple and increase complexity step by step as I suggested in my earlier message

OK so let’s start with something simple, single-core jobs with simple Python function. Can you run this in your notebook and post the output you get?

import time
import logging
import socket
import os
from pprint import pprint

from distributed import Client

from dask_jobqueue import SLURMCluster


def slow_increment(x):
    time.sleep(5)
    return {'result': x + 1,
            'host': socket.gethostname(),
            'pid': os.getpid(),
            'time': int(time.time() % 100)}


cluster = SLURMCluster(walltime='01:00:00', memory='7 GB',
                       job_extra=['--nodes=1', '--ntasks-per-node=1'],
                       cores=1)

cluster.scale(2)
client = Client(cluster)

nb_workers = 0
while True:
    nb_workers = len(client.scheduler_info()["workers"])
    print('Got {} workers'.format(nb_workers))
    if nb_workers >= 2:
        break
    time.sleep(1)

futures = client.map(slow_increment, range(8))

print('client:', client)

results = client.gather(futures)
print('results:')
pprint(results)

0reactions

guillaumeebcommented, Aug 30, 2022

Closing this issue as stale, and there are a lot of different problem in it. The last one raised is the adaptative scalign using resources, but if someone encounters it again, we should open a new issue.