Using different interfaces for scheduler and workers
See original GitHub issueI am currently trying to setup a jobqueue cluster in a HPC environment in which available network interfaces on the compute nodes (where Dask workers live) and the login nodes (where the scheduler lives) are not the same.
Available network interfaces
- on the compute nodes there are
lo
,ib0
, but noib1
and noeth{0,1}
- whereas on the login nodes there are
lo
,ib1
,eth0
, andeth1
Starting a jobqueue cluster with a basic workflow of
import dask_jobqueue
import dask.distributed as dask_distributed
jobqueue_cluster = dask_jobqueue.SLURMCluster(cores=6, memory='24GB',
project='esmtst', queue='devel',
interface='ib0')
client = dask_distributed.Client(jobqueue_cluster)
jobqueue_cluster.scale(jobs=1)
causes the following error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-b93ee7f2ce8e> in <module>
----> 1 jobqueue_cluster = dask_jobqueue.SLURMCluster(cores=6, memory='24GB',
2 project='esmtst', queue='devel',
3 interface='ib0')
/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/dask_jobqueue/core.py in __init__(self, n_workers, job_cls, loop, security, silence_logs, name, asynchronous, interface, host, protocol, dashboard_address, config_name, **kwargs)
446 worker["group"] = ["-" + str(i) for i in range(kwargs["processes"])]
447
--> 448 self._dummy_job # trigger property to ensure that the job is valid
449
450 super().__init__(
/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/dask_jobqueue/core.py in _dummy_job(self)
473 except AttributeError:
474 address = "tcp://<insert-scheduler-address-here>:8786"
--> 475 return self.job_cls(
476 address or "tcp://<insert-scheduler-address-here>:8786",
477 name="name",
/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/dask_jobqueue/slurm.py in __init__(self, queue, project, walltime, job_cpu, job_mem, job_extra, config_name, *args, **kwargs)
39 job_extra = dask.config.get("jobqueue.%s.job-extra" % config_name)
40
---> 41 super().__init__(*args, config_name=config_name, **kwargs)
42
43 header_lines = []
/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/dask_jobqueue/core.py in __init__(self, scheduler, name, cores, memory, processes, nanny, interface, death_timeout, local_directory, extra, env_extra, header_skip, log_directory, shebang, python, job_name, config_name, **kwargs)
195 if interface:
196 extra = extra + ["--interface", interface]
--> 197 kwargs.setdefault("host", get_ip_interface(interface))
198 else:
199 kwargs.setdefault("host", "")
/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/distributed/utils.py in get_ip_interface(ifname)
181 if ifname not in net_if_addrs:
182 allowed_ifnames = list(net_if_addrs.keys())
--> 183 raise ValueError(
184 "{!r} is not a valid network interface. "
185 "Valid network interfaces are: {}".format(ifname, allowed_ifnames)
ValueError: 'ib0' is not a valid network interface. Valid network interfaces are: ['lo', 'ib1', 'eth0', 'eth1']
Workaround as suggested in https://github.com/dask/dask-jobqueue/issues/207#issuecomment-566397581 doesn’t work, because Dask distributed doesn’t like interface and host to be set at the same time.
Issue Analytics
- State:
- Created 4 years ago
- Comments:26 (26 by maintainers)
Top Results From Across the Web
Scheduling Packets over Multiple ... - Stanford University
We now describe a simple and novel work-conserving algo- rithm to schedule packets over multiple interfaces with inter- face preferences.
Read more >Configure Using the Generic Scheduler Interface - MathWorks
The generic scheduler interface provides flexibility to configure the interaction of the MATLAB ® client, MATLAB workers, and a third-party scheduler. Use ......
Read more >Chapter 3 Process Scheduler (Programming Interfaces Guide)
The time-sharing policy changes priorities dynamically and assigns time slices of different lengths. The scheduler raises the priority of a process that sleeps ......
Read more >Setting up a dask distributed scheduler on two IP addresses?
The two scheduler-* interfaces are different NICs on the same machine, and there is a TCP route from client to scheduler-front , and...
Read more >IBM Workload Scheduler user interfaces
A combination of graphical and command-line and API interface programs are provided to work with IBM Workload Scheduler. In particular, the command-line ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Not sure what you mean by “violated scopes”, but there is certainly room for improvement and help on
dask-jobqueue
would be more than welcome indeed! Something you have to bear in mind is that testingdask-jobqueue
is not trivial, each cluster have their own quirks so that sometimes there is some code that is not tested but is useful. If you have suggestion on how to improve testing, this would be more than welcome as well!We already have a
docker-compose
setup to testdask-jobqueue
on Travis (see https://github.com/dask/dask-jobqueue/tree/master/ci for more details), which is why I was mentioningdocker-compose
. I am guessing (but I don’t know for sure) that adding a network interface to the worker docker image that does not exist on the scheduler docker image is not that hard.That would make sense. I am not planning to include it in #384 to keep it simple. But I would welcome a PR on this! My thoughts on this: probably this should be nested inside the
slurm
section in~/.config/dask/jobqueue.yaml
, i.e. something like:dask.config
may just be able to tackle this just fine.I think actually it would be a little bit of an overkill to debug this with a Docker compose setup. If it is helpful and if you don’t mind (and as I have indeed capacity for such tasks this week!) I could try to develop a simple pytest solution that covers the network interface setup of my HPC cluster.
Actually, I have the feeling (at least for the 0.7.0 code, I haven’t had the time to do this for the code in #384) that currently what I would call “scopes” are, or might be, “violated” at several places, and that some things could be cleaned up in that regard. I could try to come up with a list of suggestions on that matter? (For which a rather fresh perspective might be helpful.)