Unusual behaviour when scheduler cannot route to worker
See original GitHub issueWhile debugging an unrelated issue I’ve found some strange behaviour when a worker connects to a scheduler but the scheduler is not able to route back to the worker’s address.
I expect this may happen to users trying to run in containers, on the cloud or generally in unusual network conditions.
Reproduce with Docker for Desktop
We can reproduce this using Docker for Desktop. By default it is not possible for a host machine (my MacBook) to route to containers inside the container network. However it is possible for containers to route to IP addresses on the LAN, so they can connect to services on my laptop.
Start scheduler
$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Local Directory: /var/folders/0l/fmwbqvqn1tq96xf20rlz6xmm0000gp/T/scheduler-4hoeyo6f
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://192.168.0.27:8786
distributed.scheduler - INFO - dashboard at: :8787
Take a note of my LAN IP (192.168.0.27
in this example).
Start worker
Start the worker in a docker container connecting to my LAN scheduler address.
$ docker run --rm -it --name worker daskdev/dask:2.9.1 dask-worker tcp://192.168.0.27:8786
Unable to find image 'daskdev/dask:2.9.1' locally
2.9.1: Pulling from daskdev/dask
b8f262c62ec6: Already exists
0a43c0154f16: Already exists
906d7b5da8fb: Already exists
03a506d38579: Pull complete
154ee2d747b3: Pull complete
d4c2e8bc6ff3: Pull complete
Digest: sha256:1e9e5c093497b65445d978737da96893005a7f24a493a9d4df382b8cf6351c15
Status: Downloaded newer image for daskdev/dask:2.9.1
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
no environment.yml
+ '[' '' ']'
+ '[' '' ']'
+ exec dask-worker tcp://192.168.0.27:8786
distributed.nanny - INFO - Start Nanny at: 'tcp://172.17.0.2:35939'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.worker - INFO - Start worker at: tcp://172.17.0.2:43861
distributed.worker - INFO - Listening to: tcp://172.17.0.2:43861
distributed.worker - INFO - dashboard at: 172.17.0.2:43361
distributed.worker - INFO - Waiting to connect to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 6
distributed.worker - INFO - Memory: 2.10 GB
distributed.worker - INFO - Local Directory: /worker-n5_mqtoc
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
Note that the worker has started listening on it’s container IP tcp://172.17.0.2:43861
which is not routable from the scheduler.
Check the dashboard
If I look at the dashboard I can see my worker.
Send some work
In an IPython session on my laptop I run the following code. This code will hang for a while and then begin showing errors.
This works and returns 11
if I start the worker on my laptop rather than in the container.
from distributed import Client
client = Client("tcp://192.168.0.27:8786")
client.submit(lambda x: x + 1, 10).result()
Errors
The Client gives the following error.
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'lambda-ad6d4ec86bc36fbb11841f10ac714071': ('tcp://172.17.0.2:43861',)}
The scheduler has the following same error.
distributed.scheduler - ERROR - Couldn't gather keys {'lambda-ad6d4ec86bc36fbb11841f10ac714071': ['tcp://172.17.0.2:43861']} state: ['memory'] workers: ['tcp://172.17.0.2:43861']
NoneType: None
It also has many iterations of:
distributed.scheduler - INFO - Register worker <Worker 'tcp://172.17.0.2:43861', name: tcp://172.17.0.2:43861, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.17.0.2:43861
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove worker <Worker 'tcp://172.17.0.2:43861', name: tcp://172.17.0.2:43861, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://172.17.0.2:43861
The worker shows broken connection errors and then eventually closes.
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.276:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - Stopping worker at tcp://172.17.0.2:43861
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Worker closed
distributed.core - INFO - Starting established connection
distributed.nanny - INFO - Closing Nanny at 'tcp://172.17.0.2:35939'
distributed.dask_worker - INFO - End worker
Expected behaviour
If a worker connects to a scheduler but it is not possible to create the connection back the other way the worker should fail to connect.
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
Yeah, I can see how that could happen. For background, the handshake looks like this:
So because they both have to open an connection to the other we can get in this weird situation where one thinks that things are probably ok. It should, as you suggest, probably remove the worker after the initial connection fails.
@mrocklin I found the worker logs to be pretty limited in information. This is what I see for each worker
Master log looks like this