Handling workers with expiring allocation requests
See original GitHub issueI am trying to figure out how to handle the case of dask workers getting bumped from a cluster due to their requested allocation time expiring. From the intro YouTube video at https://www.youtube.com/watch?v=FXsgmwpRExM, it sounds like dask-jobqueue should detect when a worker expires and automatically start a replacement, which is what I want. However, my testing on DOE’s edison computer at NERSC is not getting that behavior. If it matters, edison uses SLURM.
I have tried setting up my cluster two ways and both behave the same. I start a worker that uses dask.delayed to do a bunch of embarrassingly parallel tasks, the server spawns one worker, that worker does the first task or two, the worker expires, the server seems to hang, and nothing else happens.
The first approach I used to setup the cluster was with “scale”:
cluster = SLURMCluster(cores=1, processes=1) # need all the memory for one task
cluster.scale(1) # testing with as simple as I can get, cycling 1 worker
client = Client(cluster, timeout='45s')
@josephhardinee suggested a 2nd approach using “adapt” instead:
cluster = SLURMCluster(cores=1, processes=1) # need all the memory for one task
cluster.adapt(minimum=1, maximum=1) # trying adapt instead of scale
client = Client(cluster, timeout='45s')
The dask-worker.err log concludes with:
slurmstepd: error: *** JOB 10234215 ON nid01242 CANCELLED AT 2018-08-10T13:25:30 DUE TO TIME LIMIT ***
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.4.227:35634'
distributed.dask_worker - INFO - Exiting on signal 15
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
Am I expecting more from dask-jobqueue than I should? Or, is this a bug in my implementation or in dask.distributed of dask-jobqueue?
Thanks, Bill
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:83 (60 by maintainers)
Top GitHub Comments
Thanks to @willsALMANJ issue a few days ago, I tried the
--lifetime
option and I confirm that it works perfectly with the latest Dask, Distributed and Jobqueue versions.The initial script I used (just reduced time):
It fails with a KilleWorkerException when the first 4 workers are killed due to walltime.
Just modify the cluster initialization:
And it works! I think it solves the problem here.
I think this would be valuable when scaling up with 100s of workers, at that point you don’t want them all to stop at the same time.
I’ll try to produce some documentation to explain all that and close this issue. The outline should look something like:
How to handle Job Queueing system walltime killing workers
Reaching walltime can be troublesome
If you don’t set the proper parameters, you’ll run into KilleWorker exceptions in thos two cases. Use
--lifetime
worker option. This will enables infinite workloads using adaptive.Use
--lifetime-stagger
when dealing with many workers.Examples
As mentioned in #126, I fear that adaptive mode is broken in release 0.3.0 of dask-jobqueue. It has latter been fixed by #63.
I would recommand trying master branch and see if that fixes this wrong behaviour.