Can dask-jobqueue use multi-node jobs (Was: Creating dask-jobqueue cluster from a single job versus multiple jobs)?
See original GitHub issueThis is not strictly an “issue” and more a question about suggested usage so if this question belongs somewhere else, please direct me there!
I’ve been working closely with admins of the NASA Pleiades HPC system on how best to support interactive Dask workflows on that system. Pleiades uses PBS. Thus far, my workflow has been to configure a cluster in which a single worker uses all available cores and memory on a single node. For example, for a machine that has 12 cores and 48 GB of memory per node, my jobqueue config is the following:
jobqueue:
pbs:
cores: 12
processes: 1
memory: 48GB
interface: ib0
resource_spec: "select=1:ncpus=12"
walltime: "00:30:00"
I then request 10-20 nodes by starting an equivalent number of jobs, e.g by running cluster.scale(10)
. A lot of the time, this configuration works well, but one of the problems that has cropped up (and is common to many systems) is high availability of resources, i.e. when the cluster is busy, I may have to wait > 30 minutes for even a single job to start; not so ideal for interactive workflows! The entire Pleiades system is in very high demand so this is often an issue, particularly on the newer, faster processors.
After raising this issue with the HPC staff, one suggested solution was to use a high-availability queue (called “devel”) that allows users to submit only a single job at a time, but has very high availability (i.e. short waiting times). In this case, the suggested pattern would be to submit a single job that requests multiple cores on multiple nodes. For 25 nodes, each with 24 cores and 128 GB of memory, the jobqueue config is:
jobqueue:
pbs:
cores: 200
memory: 2000GB
processes: 200
interface: ib0
queue: devel
resource_spec: 'select=25:ncpus=8:model=has'
walltime: "00:30:00"
Then, the user would call cluster.scale(1)
once and be done. This satisfies the 1 job restriction of the high-availability queue, reduces wait times as you have asked for all resources up front, and is also more “friendly” to the scheduler as it does not involve submitting many jobs. This pattern of course does not permit any scaling up or down, but that is a separate issue.
This is quite different than the multi-job workflow I’ve used previously (and the one that seems to be recommended in the dask-jobqueue docs) and I’m trying to wrap my head around whether this makes sens. My main questions are :
- Does this configuration pattern make sense in the context of dask-jobqueue and are there any disadvantages (other than scalability)? In the context of dask-jobqueue, the single-job cluster seems like an anti-pattern to me, but I certainly understand why this is preferable from the HPC admin perspective.
- Which node is this single Dask worker running on? and
- How is work being parallelized across multiple nodes if only a single job is running?
I’ve experimented with the above single-job workflow (and minor variations on it) and have found that computations (which worked just fine in the multi-job context) will lock up and or result in a killed worker. However, it is not entirely clear to my why this is happening.
I apologize for the lengthy post! I’m trying to get a sense of what is the most optimal usage pattern here in the context of many different configuration options and trying to wrap my head around how all of this actually works. Any advice would be extremely helpful!
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (7 by maintainers)
Top GitHub Comments
FYI I changed the title to reflect the discussion. Feel free to edit it or suggest a better title!
@guillaumeeb @lesteve Thanks for all of your help on this. I think we have a more clear picture about how to proceed with our cluster configuration on Pleiades. I’m going to close this (again!) as we seemed to have resolved our main issue, but will reopen if we run into more problems.