Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adaptive scaling and dask-jobqueue goes into endless loop when a job launches several worker processes (was: Different configs result in worker death)

See original GitHub issue

What happened: (Reposting from SO)

I’m using Dask Jobequeue on a Slurm supercomputer (I’ll note that this is also a Cray machine). My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:

cluster = SLURMCluster(cores=20,
                    processes=1,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )
cluster.adapt(minimum=0, maximum=20)
client = Client(cluster)

which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.

cluster = SLURMCluster(cores=20,
                    processes=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )

results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:

slurmstepd: error: *** JOB 11116133 ON nid00201 CANCELLED AT 2021-04-29T17:23:25 ***

Choosing a balanced configuration (i.e. default)

cluster = SLURMCluster(cores=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )

results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.

Further, I’ve found that using cluster.scale, rather than cluster.adapt, results in a successful run of the work. Perhaps the issue here is how adapt is trying to scale the number of jobs?

What you expected to happen: I would expect that changing the balance of processes / threads shouldn’t change the lifetime of a worker.

Anything else we need to know?: Possibly related to #20 and #363

As an aside, the current configuration of processes / threads confusing, and seems to conflict with how e.g. a LocalCluster is specified. Is there any progress on #231?

Environment:

Dask version: 2021.4.1
Python version: 3.8.8
Operating System: SUSE Linux Enterprise Server 12 SP3
Install method (conda, pip, source): conda

Issue Analytics

State:
Created 2 years ago
Comments:12 (8 by maintainers)

Top GitHub Comments

1reaction

guillaumeebcommented, Sep 7, 2022

On-going investigation here, it seems that it’s at the initialization from adaptive mode that the problem is, e.g. when starting the first worker process. The problem occurs when we launch adaptive without any minimum number of workers.

Using:

cluster.adapt(minimum_jobs=1, maximum_jobs=6)

Is also a workaround. But you’ll always have at least one running job (which is not that bad).

1reaction

jasonkenacommented, Sep 1, 2022

@guillaumeeb I believe I found a solution to the problem (code). When adapt kills a worker, it calls scancel on the worker’s job, inevitably killing other worker processes under the same job. To circumvent this, worker_key must be passed to Adaptive to force adapt to retire all workers under a job to kill a particular worker (JobQueueCluster should probably implement this by default). I also found specifying a higher value for interval to be helpful in preventing Dask from spawning/killing jobs every second.

Hope that helps.