question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adaptive scaling and dask-jobqueue goes into endless loop when a job launches several worker processes (was: Different configs result in worker death)

See original GitHub issue

What happened: (Reposting from SO)

I’m using Dask Jobequeue on a Slurm supercomputer (I’ll note that this is also a Cray machine). My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:

cluster = SLURMCluster(cores=20,
                    processes=1,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )
cluster.adapt(minimum=0, maximum=20)
client = Client(cluster)

which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.

cluster = SLURMCluster(cores=20,
                    processes=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )

results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:

slurmstepd: error: *** JOB 11116133 ON nid00201 CANCELLED AT 2021-04-29T17:23:25 ***

Choosing a balanced configuration (i.e. default)

cluster = SLURMCluster(cores=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )

results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.

Further, I’ve found that using cluster.scale, rather than cluster.adapt, results in a successful run of the work. Perhaps the issue here is how adapt is trying to scale the number of jobs?

What you expected to happen: I would expect that changing the balance of processes / threads shouldn’t change the lifetime of a worker.

Anything else we need to know?: Possibly related to #20 and #363

As an aside, the current configuration of processes / threads confusing, and seems to conflict with how e.g. a LocalCluster is specified. Is there any progress on #231?

Environment:

  • Dask version: 2021.4.1
  • Python version: 3.8.8
  • Operating System: SUSE Linux Enterprise Server 12 SP3
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
guillaumeebcommented, Sep 7, 2022

On-going investigation here, it seems that it’s at the initialization from adaptive mode that the problem is, e.g. when starting the first worker process. The problem occurs when we launch adaptive without any minimum number of workers.

Using:

cluster.adapt(minimum_jobs=1, maximum_jobs=6)

Is also a workaround. But you’ll always have at least one running job (which is not that bad).

1reaction
jasonkenacommented, Sep 1, 2022

@guillaumeeb I believe I found a solution to the problem (code). When adapt kills a worker, it calls scancel on the worker’s job, inevitably killing other worker processes under the same job. To circumvent this, worker_key must be passed to Adaptive to force adapt to retire all workers under a job to kill a particular worker (JobQueueCluster should probably implement this by default). I also found specifying a higher value for interval to be helpful in preventing Dask from spawning/killing jobs every second.

Hope that helps.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why did my worker die? - Dask.distributed
Workers may exit in normal functioning because they have been asked to, e.g., they received a keyboard interrupt (^C), or the scheduler scaled...
Read more >
python - Dask Jobqueue - Why does using processes result in ...
The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress...
Read more >
dask ation - manuals.plus
We launch the dask-scheduler executable in one process and the dask-worker executable in several processes, possibly on different machines.
Read more >
Correct usage of "cluster.adapt" - Distributed - Dask Forum
I want to use the adaptive scaling for running jobs on HPC clusters, but it keeps crashing after a while. Using the exact...
Read more >
dask Documentation - Read the Docs
We launch the dask-scheduler executable in one process and the dask-worker executable in several processes, possibly on different machines.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found