question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dask with jobqueue not using multiple nodes

See original GitHub issue

I am trying to use Dask to do parallel processing on multiple nodes on supercomputing resources - yet the Dask-distributed map only takes advantage of one of the nodes. Note that I put this up on stackoverflow but didn’t get attention so now I’m giving here a go.

Here is a test script I am using to set up the client and perform a simple operation:

import time
from distributed import Client
from dask_jobqueue import SLURMCluster
from socket import gethostname


def slow_increment(x):
    time.sleep(10)
    return [x + 1, gethostname(), time.time()]


cluster = SLURMCluster(
    queue='somequeue',
    cores=2,
    memory='128GB',
    project='someproject',
    walltime='00:05:00',
    job_extra=['-o myjob.%j.%N.out',
               '-e myjob.%j.%N.error'],
    env_extra=['export I_MPI_FABRICS=dapl',
               'source activate dask-jobqueue'])

cluster.scale(2)

client = Client(cluster)

A = client.map(slow_increment, range(8))
B = client.gather(A)

print(client)

for res in B:
    print(res)

client.close()

And here is the output:

<Client: scheduler='tcp://someip' processes=2 cores=4>
[1, 'bdw-0478', 1540477582.6744401]
[2, 'bdw-0478', 1540477582.67487]
[3, 'bdw-0478', 1540477592.68666]
[4, 'bdw-0478', 1540477592.6879778]
[5, 'bdw-0478', 1540477602.6986163]
[6, 'bdw-0478', 1540477602.6997452]
[7, 'bdw-0478', 1540477612.7100565]
[8, 'bdw-0478', 1540477612.711296]

While printing out the client info indicates that Dask has the correct number of nodes (processes) and tasks per node (cores), the socket.gethostname() output and time-stamps indicate that the second node isn’t used. I do know that dask-jobqueue successfully requested two nodes, and that both jobs complete at the same time. I tried using different MPI Fabrics for inter- and intra-node communication (e.g. tcp, shm:tcp, shm:ofa, ofa, ofi, dapl) but this did not change the result. I also tried removing the “export I_MPI_FABRICS” command and using the “interface” option, but this caused the code to hang.

Thanks in advance for any assistance.

-Noah

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:58 (32 by maintainers)

github_iconTop GitHub Comments

1reaction
ocaisacommented, Nov 26, 2022

@appassionate We created a library that does what you are asking for but I admit it requires quite a bit of configuration since you need to tell it about the system and how you launch MPI jobs there. The library is at https://github.com/E-CAM/jobqueue_features and there’s a tutorial at https://github.com/E-CAM/jobqueue_features_workshop_materials (and you can find a recording of the tutorial at https://www.youtube.com/watch?v=FpMua8iJeTk&ab_channel=E-CAM).

I haven’t touched it in a few months, need to check if our CI is still passing. That package is working with the latest version of jobqueue (0.8.1)

0reactions
appassionatecommented, Dec 2, 2022

@appassionate, with your comment, it is not really clear to me what you’re trying to achieve.

Anyway, if @ocaisa answer suits you, this is perfect, if not, I encourage you to open a new issue and try to make your issue a bit clearer.

Thanks for your suggestion! jobqueue_features have customized the “SlurmCluster” for some MPI using, i believe there will be some using in such as “more nodes” in slurm which is suitable for me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issue with dask.distributed in multiple nodes of a cluster
Hi all! I am working in a cluster with 3 nodes/machines. ... If i am not mistaken the Dask-jobqueue is for centrally assigning...
Read more >
Configure Dask-Jobqueue
Cores and Memory¶. These numbers correspond to the size of a single job, which is typically the size of a single node on...
Read more >
python - Dask: Jobs on multiple nodes with one worker, run ...
The values you give to the Dask Jobqueue constructors are the values for a single job for a single node. So here you...
Read more >
Parallel processing with Dask - MintPy - Read the Docs
We have tested two types of clusters: local cluster: on a single machine (laptop or computing node) with multiple CPU cores, suitable for...
Read more >
Dask Jobqueue - IPSL ESPRI MESO User documentation
Dask -Jobqueue Cluster: Dask-Jobqueue defines the concept of cluster ... If you use too many workers some may not have enough to do...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found