Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Work-arounds for CPU time limits where the Dask scheduler lives?

See original GitHub issue

One of the cluster I have access to has set CPU time limits on the login nodes.

This makes it a lot less convenient to use dask-jobqueue:

IMO the easiest thing to do is to run the Dask scheduler on a login node. This way you don’t have to write submission script and can stay in full Python
the Dask scheduler even if it consumes 5-10% CPU quickly goes over the CPU time limit (30 minutes CPU time limit currently, there is a chance it can be increased but not that much)
the Dask scheduler gets killed by CPU time limit, the Dask workers unable to contact the Dask scheduler after death-timeout seconds get killed too, so some of your tasks get killed in the middle of their execution and some of the tasks never get run. Recovering from this may not be trivial.

@dask/dask-jobqueue if anyone has suggestions/work-arounds I would be very interested!

A bit more context:

those jobs use deep learning and typically can last 1-5 days. The fact that you need the Dask scheduler to live for that much time (your Dask worker dies after death-timeout if it can not contact the Dask scheduler), maybe even longer if your Dask scheduler coordinates many jobs that don’t necessarily start at the same time is going to be an issue
the cluster only has GPU nodes (a bit of a over-simplification but accurate enough), so there is no easy way to have a CPU-only job where the Dask scheduler lives. Even if it was possible it is very unlikely that the maximum time for the CPU queue would be long enough

Possible variation of this issue:

what to do if the total work my Dask scheduler manages lasts longer than my allowed interactive job (e.g. inside my Jupyter notebook where I create the Cluster object)

Issue Analytics

State:
Created 3 years ago
Comments:11 (10 by maintainers)

Top GitHub Comments

2reactions

stuartebergcommented, Oct 27, 2020

the cluster only has GPU nodes (a bit of a over-simplification but accurate enough), so there is no easy way to have a CPU-only job where the Dask scheduler lives.

One option is to host the scheduler on one of the GPU nodes. But of course, it would be a waste of a GPU if that node can’t participate as a worker, too.

If your non-login nodes are allowed to launch jobs themselves, then I propose the following:

Run the scheduler on a GPU node
From there, launch a cluster of N-1 workers
Create one more worker locally, and add it to the cluster, so the local node is not wasted.

Here’s some example code that works on my LSF cluster. I’m not really an expert on the dask cluster API, so let me know if there’s a more straightforward way to create a heterogeneous cluster. In this code, I’m manually manipulating the worker_spec dictionary, which is used by the scale() function to create new workers.

In [1]: import time
   ...: from distributed import LocalCluster, Client
   ...: from dask_jobqueue import LSFCluster
   ...:
   ...: cluster = LSFCluster(ip='0.0.0.0', cores=1, memory='15GB', ncpus=1, mem=int(15e9), log_directory='dask-logs')
   ...: cluster.scale(2)
   ...:
   ...: client = Client(cluster)
   ...: while client.status == "running" and len(cluster.scheduler.workers) < 2:
   ...:     print("Waiting for remote workers to start...")
   ...:     time.sleep(1.0)
   ...:
   ...: # Temporarily start up a local cluster just to copy the worker spec.
   ...: with LocalCluster(1) as lc:
   ...:     local_worker_spec = lc.worker_spec[0]
   ...:
   ...: cluster.worker_spec[2] = local_worker_spec
   ...:
   ...: # Force cluster._i to increment.
   ...: # Notice the side effect in SpecCluster.new_worker_spec() implementation.
   ...: cluster.new_worker_spec()
   ...:
   ...: # Kick the cluster event loop to make it use the new worker spec.
   ...: cluster.scale(3)
Waiting for remote workers to start...
Waiting for remote workers to start...

To test that it worked, I can call client.run to see which nodes my workers are running on. As expected, two of my workers are running on a remote node, and one of them is running on the local node (h10u03), where the client and scheduler are also running.

In [2]: import socket
   ...: socket.gethostname()
Out[2]: 'h10u03.int.janelia.org'

In [3]: client.run(socket.gethostname)
Out[3]:
{'tcp://10.36.110.13:31650': 'h10u03.int.janelia.org',
 'tcp://10.36.111.17:25341': 'h11u07.int.janelia.org',
 'tcp://10.36.111.17:31518': 'h11u07.int.janelia.org'}

I know that both LSF and SGE are capable of launching jobs from ordinary (non-login) nodes, but some cluster administrators choose not to permit it for reasons I do not understand. If that’s the case for your cluster, it may be worth asking the adminstrators why they have chosen not to support such a useful feature.

1reaction

willirathcommented, Oct 28, 2020

If the time limitation hitting the scheduler cannot be fixed and hence easy resuming of Dask computations remains, there could be solutions using serialized Dask graphs. You can cloudpickle Dask graphs (even as part of high-level objects like xarray.Datasets) and there is (orphaned?) https://github.com/radix-ai/graphchain which might be helpful here.