Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Communication issue between scheduler and worker

See original GitHub issue

First off: apologies if this whole description is somewhat vague - I’m on a bit of a trial and error mission to hook up a scheduler service (running in a VM) with workers sitting on a development HPC (to eventually be auto-scaled through PBS).

In lieu of immediate firewall controls, I’m currently reverse tunneling from my scheduler machine to my HPC (currently only dealing with login nodes, so no extra network complexity of communicating with compute nodes etc.). The reverse tunnel (run from “scheduler-dev”) looks something like:

ssh -N -R 9789:localhost:9789 HPC-dev

I have started up a scheduler (on “scheduler-dev”) with:

dask-scheduler --port 9789 --http-port 9790 --bokeh-port 9791

And workers can therefore connect to it from “HPC-dev” with:

dask-worker localhost:9789

So far so good. I originally (stupidly) had a scheduler at v1.14, and a worker at v1.16 and was getting errors along the lines of:

distributed.utils - WARNING - Could not resolve hostname: tcp
Traceback (most recent call last):
  File "...distributed/utils.py", line 259, in ensure_ip
    return socket.gethostbyname(hostname)
gaierror: [Errno -2] Name or service not known
distributed.utils - ERROR - [Errno -2] Name or service not known

Which I fixed by doing something like the following in distributed.scheduler:

    if addr.startswith('tcp://'): addr = addr[6:]

(I’m not going to lie to you, I’m in hack mode and am trying to establish feasibility/PoC, rather than productionize this 😉)

Daftness resolved… I updated my scheduler to v1.16 and dodged this issue altogether. However, it seems that some of the messages coming from the worker through the tunnel to the scheduler are byte-strings rather than the expected strings. Some debug logs to demonstrate:

distributed.core - DEBUG - Connection from 'tcp://127.0.0.1:55537' to Scheduler
distributed.core - DEBUG - Connection from 'tcp://127.0.0.1:55538' to Scheduler
distributed.core - DEBUG - Connection from 'tcp://127.0.0.1:55539' to Scheduler
distributed.core - DEBUG - Message from 'tcp://127.0.0.1:55539': {'op': 'feed', 'setup': b'\x80\x04\x957\x00\x00\x00\x00\x00\x00\x00\x8c#distributed.diagnostics.eventstream\x94\x8c\x0bEventStream\x94\x93\x94.', 'function': b'\x80\x04\x957\x00\x00\x00\x00\x00\x00\x00\x8c#distributed.diagnostics.eventstream\x94\x8c\x0bswap_buffer\x94\x93\x94.', 'interval': 0.1, 'teardown': b'\x80\x04\x954\x00\x00\x00\x00\x00\x00\x00\x8c#distributed.diagnostics.eventstream\x94\x8c\x08teardown\x94\x93\x94.'}
distributed.core - DEBUG - Calling into handler feed
distributed.core - DEBUG - Message from 'tcp://127.0.0.1:55538': {'op': 'feed', 'setup': b'\x80\x04\x954\x00\x00\x00\x00\x00\x00\x00\x8c distributed.diagnostics.progress\x94\x8c\x0bAllProgress\x94\x93\x94.', 'function': b"\x80\x04\x956\x00\x00\x00\x00\x00\x00\x00\x8c'distributed.diagnostics.progress_stream\x94\x8c\x06counts\x94\x93\x94.", 'interval': 0.05, 'teardown': b'\x80\x04\x955\x00\x00\x00\x00\x00\x00\x00\x8c\x15distributed.scheduler\x94\x8c\x17Scheduler.remove_plugin\x94\x93\x94.'}
distributed.core - DEBUG - Calling into handler feed
distributed.core - DEBUG - Message from 'tcp://127.0.0.1:55537': {'op': 'feed', 'function': b'\x80\x04\x954\x00\x00\x00\x00\x00\x00\x00\x8c!distributed.diagnostics.scheduler\x94\x8c\nprocessing\x94\x93\x94.', 'interval': 0.2}
distributed.core - DEBUG - Calling into handler feed
distributed.core - DEBUG - Connection from 'tcp://127.0.0.1:55636' to Scheduler
distributed.core - DEBUG - Message from 'tcp://127.0.0.1:55636': {b'memory_limit': 162231150182.0, b'name': b'tcp://127.0.0.1:33562', b'keys': [], b'host_info': {b'time': 1488259979.092472, b'network-send': 0, b'disk-write': 0, b'memory_percent': 23.9, b'disk-read': 0, b'memory': 270385250304, b'cpu': 3.4, b'network-recv': 0}, b'local_directory': b'/tmp/nanny-mbS9pp', b'services': {b'bokeh': 8789, b'http': 35531, b'nanny': 32857}, b'nbytes': {}, b'address': b'tcp://127.0.0.1:33562', b'ncores': 20, b'reply': True, b'now': 1488259979.092471, b'resources': {}, b'op': b'register'}
distributed.core - WARNING - No handler found: b'register'
Traceback (most recent call last):
  File "...distributed/core.py", line 252, in handle_comm
    handler = self.handlers[op]
KeyError: b'register'

I can reproduce this issue consistently with this setup. I have a few tweaks up my sleeve to iron these issues out, but would love to avoid them if possible.

Is this something you’ve seen before? Am I missing something obvious?

Issue Analytics

State:
Created 7 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Feb 28, 2017

Note that client.get_versions(check=True) is a handy way to verify a consistent environment.

On Tue, Feb 28, 2017 at 3:42 AM, Phil Elson notifications@github.com wrote:

Closed #912 https://github.com/dask/distributed/issues/912.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/912#event-979750728, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIHbji_JlbPwENjh1_kGByghUU9Lks5rg94RgaJpZM4MN_YS .

0reactions

pelsoncommented, Feb 28, 2017

The fact that this traceback is now indexed is good enough for me. Thanks for the confirmation @pitrou.

Top Results From Across the Web

Tips on Communication for Better Employee Scheduling

The most efficient way for all employees to communicate about scheduling is by going paperless and streamlining all scheduling-related communications into a ...

6 Rules for Better Internal Communication With Your Hourly ...

Communication from the scheduler to the employee should be specific, purposeful, and aligned with your scheduling processes. Outreach that reinforces your ...

12 Common Employee Scheduling Problems & Solutions

Let's break down the most common scheduling problems that businesses face and the best solutions to each of them.

How to Handle Schedule Conflicts | Wrike

Everyone faces schedule conflicts at work at some point. They may be caused by tight deadlines on a new project, an error in...

How to Effectively Handle Scheduling Conflicts - Celayix

Scheduling conflicts are a common occurrence but can be a nightmare to deal with. Take a look at how to effectively manage them...