Communication issue between scheduler and worker
See original GitHub issueFirst off: apologies if this whole description is somewhat vague - I’m on a bit of a trial and error mission to hook up a scheduler service (running in a VM) with workers sitting on a development HPC (to eventually be auto-scaled through PBS).
In lieu of immediate firewall controls, I’m currently reverse tunneling from my scheduler machine to my HPC (currently only dealing with login nodes, so no extra network complexity of communicating with compute nodes etc.). The reverse tunnel (run from “scheduler-dev”) looks something like:
ssh -N -R 9789:localhost:9789 HPC-dev
I have started up a scheduler (on “scheduler-dev”) with:
dask-scheduler --port 9789 --http-port 9790 --bokeh-port 9791
And workers can therefore connect to it from “HPC-dev” with:
dask-worker localhost:9789
So far so good. I originally (stupidly) had a scheduler at v1.14, and a worker at v1.16 and was getting errors along the lines of:
distributed.utils - WARNING - Could not resolve hostname: tcp
Traceback (most recent call last):
File "...distributed/utils.py", line 259, in ensure_ip
return socket.gethostbyname(hostname)
gaierror: [Errno -2] Name or service not known
distributed.utils - ERROR - [Errno -2] Name or service not known
Which I fixed by doing something like the following in distributed.scheduler
:
if addr.startswith('tcp://'): addr = addr[6:]
(I’m not going to lie to you, I’m in hack mode and am trying to establish feasibility/PoC, rather than productionize this 😉)
Daftness resolved… I updated my scheduler to v1.16 and dodged this issue altogether. However, it seems that some of the messages coming from the worker through the tunnel to the scheduler are byte-strings rather than the expected strings. Some debug logs to demonstrate:
distributed.core - DEBUG - Connection from 'tcp://127.0.0.1:55537' to Scheduler
distributed.core - DEBUG - Connection from 'tcp://127.0.0.1:55538' to Scheduler
distributed.core - DEBUG - Connection from 'tcp://127.0.0.1:55539' to Scheduler
distributed.core - DEBUG - Message from 'tcp://127.0.0.1:55539': {'op': 'feed', 'setup': b'\x80\x04\x957\x00\x00\x00\x00\x00\x00\x00\x8c#distributed.diagnostics.eventstream\x94\x8c\x0bEventStream\x94\x93\x94.', 'function': b'\x80\x04\x957\x00\x00\x00\x00\x00\x00\x00\x8c#distributed.diagnostics.eventstream\x94\x8c\x0bswap_buffer\x94\x93\x94.', 'interval': 0.1, 'teardown': b'\x80\x04\x954\x00\x00\x00\x00\x00\x00\x00\x8c#distributed.diagnostics.eventstream\x94\x8c\x08teardown\x94\x93\x94.'}
distributed.core - DEBUG - Calling into handler feed
distributed.core - DEBUG - Message from 'tcp://127.0.0.1:55538': {'op': 'feed', 'setup': b'\x80\x04\x954\x00\x00\x00\x00\x00\x00\x00\x8c distributed.diagnostics.progress\x94\x8c\x0bAllProgress\x94\x93\x94.', 'function': b"\x80\x04\x956\x00\x00\x00\x00\x00\x00\x00\x8c'distributed.diagnostics.progress_stream\x94\x8c\x06counts\x94\x93\x94.", 'interval': 0.05, 'teardown': b'\x80\x04\x955\x00\x00\x00\x00\x00\x00\x00\x8c\x15distributed.scheduler\x94\x8c\x17Scheduler.remove_plugin\x94\x93\x94.'}
distributed.core - DEBUG - Calling into handler feed
distributed.core - DEBUG - Message from 'tcp://127.0.0.1:55537': {'op': 'feed', 'function': b'\x80\x04\x954\x00\x00\x00\x00\x00\x00\x00\x8c!distributed.diagnostics.scheduler\x94\x8c\nprocessing\x94\x93\x94.', 'interval': 0.2}
distributed.core - DEBUG - Calling into handler feed
distributed.core - DEBUG - Connection from 'tcp://127.0.0.1:55636' to Scheduler
distributed.core - DEBUG - Message from 'tcp://127.0.0.1:55636': {b'memory_limit': 162231150182.0, b'name': b'tcp://127.0.0.1:33562', b'keys': [], b'host_info': {b'time': 1488259979.092472, b'network-send': 0, b'disk-write': 0, b'memory_percent': 23.9, b'disk-read': 0, b'memory': 270385250304, b'cpu': 3.4, b'network-recv': 0}, b'local_directory': b'/tmp/nanny-mbS9pp', b'services': {b'bokeh': 8789, b'http': 35531, b'nanny': 32857}, b'nbytes': {}, b'address': b'tcp://127.0.0.1:33562', b'ncores': 20, b'reply': True, b'now': 1488259979.092471, b'resources': {}, b'op': b'register'}
distributed.core - WARNING - No handler found: b'register'
Traceback (most recent call last):
File "...distributed/core.py", line 252, in handle_comm
handler = self.handlers[op]
KeyError: b'register'
I can reproduce this issue consistently with this setup. I have a few tweaks up my sleeve to iron these issues out, but would love to avoid them if possible.
Is this something you’ve seen before? Am I missing something obvious?
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
Note that client.get_versions(check=True) is a handy way to verify a consistent environment.
On Tue, Feb 28, 2017 at 3:42 AM, Phil Elson notifications@github.com wrote:
The fact that this traceback is now indexed is good enough for me. Thanks for the confirmation @pitrou.