Adaptive scaling and dask-jobqueue goes into endless loop when a job launches several worker processes (was: Different configs result in worker death)
See original GitHub issueWhat happened: (Reposting from SO)
I’m using Dask Jobequeue on a Slurm supercomputer (I’ll note that this is also a Cray machine). My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:
cluster = SLURMCluster(cores=20,
processes=1,
memory="60GB",
walltime='12:00:00',
...
)
cluster.adapt(minimum=0, maximum=20)
client = Client(cluster)
which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.
cluster = SLURMCluster(cores=20,
processes=20,
memory="60GB",
walltime='12:00:00',
...
)
results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:
slurmstepd: error: *** JOB 11116133 ON nid00201 CANCELLED AT 2021-04-29T17:23:25 ***
Choosing a balanced configuration (i.e. default)
cluster = SLURMCluster(cores=20,
memory="60GB",
walltime='12:00:00',
...
)
results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.
Further, I’ve found that using cluster.scale
, rather than cluster.adapt
, results in a successful run of the work. Perhaps the issue here is how adapt is trying to scale the number of jobs?
What you expected to happen: I would expect that changing the balance of processes / threads shouldn’t change the lifetime of a worker.
Anything else we need to know?: Possibly related to #20 and #363
As an aside, the current configuration of processes / threads confusing, and seems to conflict with how e.g. a LocalCluster
is specified. Is there any progress on #231?
Environment:
- Dask version: 2021.4.1
- Python version: 3.8.8
- Operating System: SUSE Linux Enterprise Server 12 SP3
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (8 by maintainers)
Top GitHub Comments
On-going investigation here, it seems that it’s at the initialization from adaptive mode that the problem is, e.g. when starting the first worker process. The problem occurs when we launch adaptive without any minimum number of workers.
Using:
Is also a workaround. But you’ll always have at least one running job (which is not that bad).
@guillaumeeb I believe I found a solution to the problem (code). When adapt kills a worker, it calls scancel on the worker’s job, inevitably killing other worker processes under the same job. To circumvent this,
worker_key
must be passed to Adaptive to force adapt to retire all workers under a job to kill a particular worker (JobQueueCluster
should probably implement this by default). I also found specifying a higher value forinterval
to be helpful in preventing Dask from spawning/killing jobs every second.Hope that helps.