Work-arounds for CPU time limits where the Dask scheduler lives?
See original GitHub issueOne of the cluster I have access to has set CPU time limits on the login nodes.
This makes it a lot less convenient to use dask-jobqueue
:
- IMO the easiest thing to do is to run the Dask scheduler on a login node. This way you don’t have to write submission script and can stay in full Python
- the Dask scheduler even if it consumes 5-10% CPU quickly goes over the CPU time limit (30 minutes CPU time limit currently, there is a chance it can be increased but not that much)
- the Dask scheduler gets killed by CPU time limit, the Dask workers unable to contact the Dask scheduler after
death-timeout
seconds get killed too, so some of your tasks get killed in the middle of their execution and some of the tasks never get run. Recovering from this may not be trivial.
@dask/dask-jobqueue if anyone has suggestions/work-arounds I would be very interested!
A bit more context:
- those jobs use deep learning and typically can last 1-5 days. The fact that you need the Dask scheduler to live for that much time (your Dask worker dies after
death-timeout
if it can not contact the Dask scheduler), maybe even longer if your Dask scheduler coordinates many jobs that don’t necessarily start at the same time is going to be an issue - the cluster only has GPU nodes (a bit of a over-simplification but accurate enough), so there is no easy way to have a CPU-only job where the Dask scheduler lives. Even if it was possible it is very unlikely that the maximum time for the CPU queue would be long enough
Possible variation of this issue:
- what to do if the total work my Dask scheduler manages lasts longer than my allowed interactive job (e.g. inside my Jupyter notebook where I create the
Cluster
object)
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
Dask Best Practices - Dask documentation
This is a short overview of Dask best practices. This document specifically focuses on best practices that are shared among all of the...
Read more >Dask Scheduler Memory - Stack Overflow
our dask scheduler process seems to balloon in memory as time goes on and executions continue. Currently we see it using 5GB of...
Read more >Chunk data and parallelize computation - icclim - Read the Docs
When poorly configured, the computation can spend most of its CPU time reading and writing chunks on disk. You can visualize this case...
Read more >Machine learning on distributed Dask using ... - Amazon AWS
Register the Dask schedulers task's private IP as the target in the NLB. ... Load Balancer with certificates and appropriate firewall rules.
Read more >Up the “Data Processing” Ante with Modin Pandas-Dask and ...
Suppose one project requires heavy computations related to the analysis of large datasets collected from varied sources and the deadline is ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
One option is to host the scheduler on one of the GPU nodes. But of course, it would be a waste of a GPU if that node can’t participate as a worker, too.
If your non-login nodes are allowed to launch jobs themselves, then I propose the following:
Here’s some example code that works on my LSF cluster. I’m not really an expert on the dask cluster API, so let me know if there’s a more straightforward way to create a heterogeneous cluster. In this code, I’m manually manipulating the
worker_spec
dictionary, which is used by thescale()
function to create new workers.To test that it worked, I can call
client.run
to see which nodes my workers are running on. As expected, two of my workers are running on a remote node, and one of them is running on the local node (h10u03
), where the client and scheduler are also running.I know that both LSF and SGE are capable of launching jobs from ordinary (non-login) nodes, but some cluster administrators choose not to permit it for reasons I do not understand. If that’s the case for your cluster, it may be worth asking the adminstrators why they have chosen not to support such a useful feature.
If the time limitation hitting the scheduler cannot be fixed and hence easy resuming of Dask computations remains, there could be solutions using serialized Dask graphs. You can
cloudpickle
Dask graphs (even as part of high-level objects likexarray.Datasets
) and there is (orphaned?) https://github.com/radix-ai/graphchain which might be helpful here.