question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using different interfaces for scheduler and workers

See original GitHub issue

I am currently trying to setup a jobqueue cluster in a HPC environment in which available network interfaces on the compute nodes (where Dask workers live) and the login nodes (where the scheduler lives) are not the same.

Available network interfaces

  • on the compute nodes there are lo, ib0, but no ib1 and no eth{0,1}
  • whereas on the login nodes there are lo, ib1, eth0, and eth1

Starting a jobqueue cluster with a basic workflow of

import dask_jobqueue
import dask.distributed as dask_distributed
jobqueue_cluster = dask_jobqueue.SLURMCluster(cores=6, memory='24GB',
                                              project='esmtst', queue='devel',
                                              interface='ib0')
client = dask_distributed.Client(jobqueue_cluster)
jobqueue_cluster.scale(jobs=1)

causes the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-b93ee7f2ce8e> in <module>
----> 1 jobqueue_cluster = dask_jobqueue.SLURMCluster(cores=6, memory='24GB',
      2                                               project='esmtst', queue='devel',
      3                                               interface='ib0')

/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/dask_jobqueue/core.py in __init__(self, n_workers, job_cls, loop, security, silence_logs, name, asynchronous, interface, host, protocol, dashboard_address, config_name, **kwargs)
    446             worker["group"] = ["-" + str(i) for i in range(kwargs["processes"])]
    447
--> 448         self._dummy_job  # trigger property to ensure that the job is valid
    449
    450         super().__init__(

/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/dask_jobqueue/core.py in _dummy_job(self)
    473         except AttributeError:
    474             address = "tcp://<insert-scheduler-address-here>:8786"
--> 475         return self.job_cls(
    476             address or "tcp://<insert-scheduler-address-here>:8786",
    477             name="name",

/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/dask_jobqueue/slurm.py in __init__(self, queue, project, walltime, job_cpu, job_mem, job_extra, config_name, *args, **kwargs)
     39             job_extra = dask.config.get("jobqueue.%s.job-extra" % config_name)
     40
---> 41         super().__init__(*args, config_name=config_name, **kwargs)
     42
     43         header_lines = []

/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/dask_jobqueue/core.py in __init__(self, scheduler, name, cores, memory, processes, nanny, interface, death_timeout, local_directory, extra, env_extra, header_skip, log_directory, shebang, python, job_name, config_name, **kwargs)
    195         if interface:
    196             extra = extra + ["--interface", interface]
--> 197             kwargs.setdefault("host", get_ip_interface(interface))
    198         else:
    199             kwargs.setdefault("host", "")

/p/project/cesmtst/hoeflich1/miniconda3/envs/Dask-jobqueue_v2020.02.10/lib/python3.8/site-packages/distributed/utils.py in get_ip_interface(ifname)
    181     if ifname not in net_if_addrs:
    182         allowed_ifnames = list(net_if_addrs.keys())
--> 183         raise ValueError(
    184             "{!r} is not a valid network interface. "
    185             "Valid network interfaces are: {}".format(ifname, allowed_ifnames)

ValueError: 'ib0' is not a valid network interface. Valid network interfaces are: ['lo', 'ib1', 'eth0', 'eth1']

Workaround as suggested in https://github.com/dask/dask-jobqueue/issues/207#issuecomment-566397581 doesn’t work, because Dask distributed doesn’t like interface and host to be set at the same time.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:26 (26 by maintainers)

github_iconTop GitHub Comments

1reaction
lestevecommented, Mar 11, 2020

Actually, I have the feeling (at least for the 0.7.0 code, I haven’t had the time to do this for the code in #384) that currently what I would call “scopes” are, or might be, “violated” at several places, and that some things could be cleaned up in that regard. I could try to come up with a list of suggestions on that matter? (For which a rather fresh perspective might be helpful.)

Not sure what you mean by “violated scopes”, but there is certainly room for improvement and help on dask-jobqueue would be more than welcome indeed! Something you have to bear in mind is that testing dask-jobqueue is not trivial, each cluster have their own quirks so that sometimes there is some code that is not tested but is useful. If you have suggestion on how to improve testing, this would be more than welcome as well!

I think actually it would be a little bit of an overkill to debug this with a Docker compose setup.

We already have a docker-compose setup to test dask-jobqueue on Travis (see https://github.com/dask/dask-jobqueue/tree/master/ci for more details), which is why I was mentioning docker-compose. I am guessing (but I don’t know for sure) that adding a network interface to the worker docker image that does not exist on the scheduler docker image is not that hard.

Also, it would be really nice if the new scheduler options could be set in a configuration file.

That would make sense. I am not planning to include it in #384 to keep it simple. But I would welcome a PR on this! My thoughts on this: probably this should be nested inside the slurm section in ~/.config/dask/jobqueue.yaml, i.e. something like:

jobqueue:
  slurm:
    name: dask-worker
    walltime: '00:30:00'
    ...

    scheduler_options:
      interface: eth1
      dashboard_address: :8787

dask.config may just be able to tackle this just fine.

1reaction
kathoefcommented, Mar 11, 2020

I need to think more how to find a work-around for this … it may be worth for me to try to reproduce the problem in a docker-compose setup so that I can debug more easily.

I think actually it would be a little bit of an overkill to debug this with a Docker compose setup. If it is helpful and if you don’t mind (and as I have indeed capacity for such tasks this week!) I could try to develop a simple pytest solution that covers the network interface setup of my HPC cluster.

Actually, I have the feeling (at least for the 0.7.0 code, I haven’t had the time to do this for the code in #384) that currently what I would call “scopes” are, or might be, “violated” at several places, and that some things could be cleaned up in that regard. I could try to come up with a list of suggestions on that matter? (For which a rather fresh perspective might be helpful.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scheduling Packets over Multiple ... - Stanford University
We now describe a simple and novel work-conserving algo- rithm to schedule packets over multiple interfaces with inter- face preferences.
Read more >
Configure Using the Generic Scheduler Interface - MathWorks
The generic scheduler interface provides flexibility to configure the interaction of the MATLAB ® client, MATLAB workers, and a third-party scheduler. Use ......
Read more >
Chapter 3 Process Scheduler (Programming Interfaces Guide)
The time-sharing policy changes priorities dynamically and assigns time slices of different lengths. The scheduler raises the priority of a process that sleeps ......
Read more >
Setting up a dask distributed scheduler on two IP addresses?
The two scheduler-* interfaces are different NICs on the same machine, and there is a TCP route from client to scheduler-front , and...
Read more >
IBM Workload Scheduler user interfaces
A combination of graphical and command-line and API interface programs are provided to work with IBM Workload Scheduler. In particular, the command-line ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found