question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Work-arounds for CPU time limits where the Dask scheduler lives?

See original GitHub issue

One of the cluster I have access to has set CPU time limits on the login nodes.

This makes it a lot less convenient to use dask-jobqueue:

  • IMO the easiest thing to do is to run the Dask scheduler on a login node. This way you don’t have to write submission script and can stay in full Python
  • the Dask scheduler even if it consumes 5-10% CPU quickly goes over the CPU time limit (30 minutes CPU time limit currently, there is a chance it can be increased but not that much)
  • the Dask scheduler gets killed by CPU time limit, the Dask workers unable to contact the Dask scheduler after death-timeout seconds get killed too, so some of your tasks get killed in the middle of their execution and some of the tasks never get run. Recovering from this may not be trivial.

@dask/dask-jobqueue if anyone has suggestions/work-arounds I would be very interested!

A bit more context:

  • those jobs use deep learning and typically can last 1-5 days. The fact that you need the Dask scheduler to live for that much time (your Dask worker dies after death-timeout if it can not contact the Dask scheduler), maybe even longer if your Dask scheduler coordinates many jobs that don’t necessarily start at the same time is going to be an issue
  • the cluster only has GPU nodes (a bit of a over-simplification but accurate enough), so there is no easy way to have a CPU-only job where the Dask scheduler lives. Even if it was possible it is very unlikely that the maximum time for the CPU queue would be long enough

Possible variation of this issue:

  • what to do if the total work my Dask scheduler manages lasts longer than my allowed interactive job (e.g. inside my Jupyter notebook where I create the Cluster object)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
stuartebergcommented, Oct 27, 2020

the cluster only has GPU nodes (a bit of a over-simplification but accurate enough), so there is no easy way to have a CPU-only job where the Dask scheduler lives.

One option is to host the scheduler on one of the GPU nodes. But of course, it would be a waste of a GPU if that node can’t participate as a worker, too.

If your non-login nodes are allowed to launch jobs themselves, then I propose the following:

  1. Run the scheduler on a GPU node
  2. From there, launch a cluster of N-1 workers
  3. Create one more worker locally, and add it to the cluster, so the local node is not wasted.

Here’s some example code that works on my LSF cluster. I’m not really an expert on the dask cluster API, so let me know if there’s a more straightforward way to create a heterogeneous cluster. In this code, I’m manually manipulating the worker_spec dictionary, which is used by the scale() function to create new workers.

In [1]: import time
   ...: from distributed import LocalCluster, Client
   ...: from dask_jobqueue import LSFCluster
   ...:
   ...: cluster = LSFCluster(ip='0.0.0.0', cores=1, memory='15GB', ncpus=1, mem=int(15e9), log_directory='dask-logs')
   ...: cluster.scale(2)
   ...:
   ...: client = Client(cluster)
   ...: while client.status == "running" and len(cluster.scheduler.workers) < 2:
   ...:     print("Waiting for remote workers to start...")
   ...:     time.sleep(1.0)
   ...:
   ...: # Temporarily start up a local cluster just to copy the worker spec.
   ...: with LocalCluster(1) as lc:
   ...:     local_worker_spec = lc.worker_spec[0]
   ...:
   ...: cluster.worker_spec[2] = local_worker_spec
   ...:
   ...: # Force cluster._i to increment.
   ...: # Notice the side effect in SpecCluster.new_worker_spec() implementation.
   ...: cluster.new_worker_spec()
   ...:
   ...: # Kick the cluster event loop to make it use the new worker spec.
   ...: cluster.scale(3)
Waiting for remote workers to start...
Waiting for remote workers to start...

To test that it worked, I can call client.run to see which nodes my workers are running on. As expected, two of my workers are running on a remote node, and one of them is running on the local node (h10u03), where the client and scheduler are also running.

In [2]: import socket
   ...: socket.gethostname()
Out[2]: 'h10u03.int.janelia.org'

In [3]: client.run(socket.gethostname)
Out[3]:
{'tcp://10.36.110.13:31650': 'h10u03.int.janelia.org',
 'tcp://10.36.111.17:25341': 'h11u07.int.janelia.org',
 'tcp://10.36.111.17:31518': 'h11u07.int.janelia.org'}

I know that both LSF and SGE are capable of launching jobs from ordinary (non-login) nodes, but some cluster administrators choose not to permit it for reasons I do not understand. If that’s the case for your cluster, it may be worth asking the adminstrators why they have chosen not to support such a useful feature.

1reaction
willirathcommented, Oct 28, 2020

If the time limitation hitting the scheduler cannot be fixed and hence easy resuming of Dask computations remains, there could be solutions using serialized Dask graphs. You can cloudpickle Dask graphs (even as part of high-level objects like xarray.Datasets) and there is (orphaned?) https://github.com/radix-ai/graphchain which might be helpful here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask Best Practices - Dask documentation
This is a short overview of Dask best practices. This document specifically focuses on best practices that are shared among all of the...
Read more >
Dask Scheduler Memory - Stack Overflow
our dask scheduler process seems to balloon in memory as time goes on and executions continue. Currently we see it using 5GB of...
Read more >
Chunk data and parallelize computation - icclim - Read the Docs
When poorly configured, the computation can spend most of its CPU time reading and writing chunks on disk. You can visualize this case...
Read more >
Machine learning on distributed Dask using ... - Amazon AWS
Register the Dask schedulers task's private IP as the target in the NLB. ... Load Balancer with certificates and appropriate firewall rules.
Read more >
Up the “Data Processing” Ante with Modin Pandas-Dask and ...
Suppose one project requires heavy computations related to the analysis of large datasets collected from varied sources and the deadline is ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found