OSError: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time
See original GitHub issueHi,
I am trying to use dask.distributed on slurm scheduler on a cluster. I tried some very simple example, just trying to make it work. However, the results are not very consistent. I searched online about this issue. Updating tornado doesn’t work for me. Here is the code I tried to run on slurm cluster.
The following code basically build a SLURMCluster and run joblib on it so that my parallelization can run over multiple machines. What I do is not using interactive session, but use batch submission. The batch script is attached after the following python code (python 3).
"""try_dask_joblib.py"""
import time
from joblib import Parallel, delayed, parallel_backend
import dask
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
import socket
cluster = SLURMCluster(cores=8,
memory="10GB",
project="p30309",
walltime="00:59:59",
queue="short",
job_extra=[
"--output=../std_output/sim_oe_joblib_dask_2_3_2.txt",
"--open-mode=append"],
death_timeout=100)
print(dask.config.get("jobqueue.slurm.death-timeout"))
print(dir(cluster))
print("The worker spec is:\n {}".format(cluster.worker_spec,))
cluster.scale(10) # This line is necessary to start workers.
client = Client(cluster)
print("The worker spec is:\n {}".format(cluster.worker_spec,))
print("The scheduler info is:\n {}".format(client.scheduler_info(),))
print(dir(client))
print(client.status, client.start)
def long_running_function(i):
start_time = time.time()
while time.time()-start_time < 10:
temp_start_time = time.time()
while time.time() - temp_start_time < 1:
i**2
# print(time.time())
print(i, time.time(), time.time()-start_time)
print("The hostname is {}.\n".format(socket.gethostname(),))
return i, time.time(), time.time()-start_time, socket.gethostname()
with parallel_backend('dask', wait_for_workers_timeout=240, n_jobs=40):
res = Parallel(verbose=100, pre_dispatch='4*n_jobs')(
delayed(long_running_function)(i) for i in range(400))
print(res)
#!/bin/bash
#SBATCH -J cr_model
#SBATCH -A p30309
#SBATCH -p short
#SBATCH -t 0:59:59
#SBATCH -n 2 # Number of all cores for scheduler.
#SBATCH --mem=10000 # Unit is MB.
#SBATCH --output=../std_output/sim_oe_joblib_dask_2_3_2.txt
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE
# unload any modules that carried over from your command line session
module purge
# Set your working directory
cd /projects/p30309/CD/CD_github/ #$PBS_O_WORKDIR
# load modules you need to use
module load python/anaconda3.6
source activate py37
# Test joblib with dask
python ./command_script/try_dask_joblib.py
Problem 1: One problem is that every time, I seems to get different number of workers shown in output [Parallel(n_jobs=40)]: Using backend DaskDistributedBackend with 56 concurrent workers.
Shouldn’t be 80 if everything works fine?
Problem 2: I got error when I use python 3.
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.131.42:43309'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.129.110:35846'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.131.68:43185'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.131.51:43124'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.131.68:33089'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.130.128:39328'
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-j3c7w8s_', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-j3c7w8s_', purging
distributed.diskutils - ERROR - Failed to remove '/projects/p30309/CD/CD_github/worker-j3c7w8s_' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/projects/p30309/CD/CD_github/worker-j3c7w8s_'
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-doc4tad4', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-nml18s3o', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-flged81k', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-mlkny2rl', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-mh78vq49', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-3hxx95rw', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-w5xswvkl', purging
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.129.45:46697'
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.129.45:35281'
distributed.worker - INFO - Start worker at: tcp://172.20.129.110:41685
distributed.worker - INFO - Listening to: tcp://172.20.129.110:41685
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-q441qqyt
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Start worker at: tcp://172.20.131.42:36892
distributed.worker - INFO - Listening to: tcp://172.20.131.42:36892
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-xauyk8g7
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://172.20.131.68:35567
distributed.worker - INFO - Listening to: tcp://172.20.131.68:35567
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-u31kv4ju
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - Start worker at: tcp://172.20.131.51:34705
distributed.worker - INFO - Listening to: tcp://172.20.131.51:34705
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-b_i71_pe
distributed.worker - INFO - Start worker at: tcp://172.20.131.68:34187
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-q441qqyt', purging
distributed.worker - INFO - Listening to: tcp://172.20.131.68:34187
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-irf7n30m
distributed.worker - INFO - -------------------------------------------------
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-b_i71_pe', purging
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-xauyk8g7', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-irf7n30m', purging
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-u31kv4ju', purging
distributed.worker - INFO - Start worker at: tcp://172.20.130.128:41771
distributed.worker - INFO - Listening to: tcp://172.20.130.128:41771
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-ojdbnzgn
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.130.6:42598'
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-ojdbnzgn', purging
distributed.worker - INFO - Start worker at: tcp://172.20.129.45:44277
distributed.worker - INFO - Listening to: tcp://172.20.129.45:44277
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-n0agt42u
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Start worker at: tcp://172.20.129.45:38788
distributed.worker - INFO - Listening to: tcp://172.20.129.45:38788
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-0xk48yqm
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-n0agt42u', purging
distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.134.54:39716'
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-0xk48yqm', purging
distributed.worker - INFO - Start worker at: tcp://172.20.130.6:36629
distributed.worker - INFO - Listening to: tcp://172.20.130.6:36629
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-yljqfs35
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.diskutils - INFO - Found stale lock file and directory '/projects/p30309/CD/CD_github/worker-yljqfs35', purging
distributed.worker - INFO - Start worker at: tcp://172.20.134.54:33258
distributed.worker - INFO - Listening to: tcp://172.20.134.54:33258
distributed.worker - INFO - Waiting to connect to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /projects/p30309/CD/CD_github/worker-tm7fx9_4
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.20.129.23:43394
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51462 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51464 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51474 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51476 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51750 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44572 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35106 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.130.128:60842 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51470 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51472 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51752 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35108 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.130.128:60844 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35110 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.134.54:51920 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51754 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55900 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44574 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51756 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51758 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44576 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44578 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51760 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51762 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51768 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51770 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.130.128:60846 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.130.128:60848 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44580 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35112 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55902 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35114 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55904 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44582 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51482 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51484 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51764 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51766 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51776 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51778 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.134.54:51922 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55906 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55908 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44584 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51780 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51782 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55910 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55912 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35116 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35118 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44586 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44588 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35120 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35122 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51486 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51784 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.68:51786 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55914 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51488 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55916 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51490 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35124 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51492 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51494 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35126 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.45:51496 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35128 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35130 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44600 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44602 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44604 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55918 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55920 remote=tcp://172.20.129.23:43394>
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2b55563fe990>>, <Task finished coro=<Worker.heartbeat() done, defined at /home/kzy816/.local/lib/python3.7/site-packages/distributed/worker.py:871> exception=OSError("Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
_raise(error)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kzy816/.local/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/home/kzy816/.local/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/worker.py", line 880, in heartbeat
metrics=await self.get_metrics(),
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/core.py", line 735, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/core.py", line 869, in connect
connection_args=self.connection_args,
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.42:55922 remote=tcp://172.20.129.23:43394>
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2b65423cb850>>, <Task finished coro=<Worker.heartbeat() done, defined at /home/kzy816/.local/lib/python3.7/site-packages/distributed/worker.py:871> exception=OSError("Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
_raise(error)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kzy816/.local/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/home/kzy816/.local/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/worker.py", line 880, in heartbeat
metrics=await self.get_metrics(),
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/core.py", line 735, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/core.py", line 869, in connect
connection_args=self.connection_args,
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2b1a63955b90>>, <Task finished coro=<Worker.heartbeat() done, defined at /home/kzy816/.local/lib/python3.7/site-packages/distributed/worker.py:871> exception=OSError("Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
_raise(error)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kzy816/.local/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/home/kzy816/.local/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/worker.py", line 880, in heartbeat
metrics=await self.get_metrics(),
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/core.py", line 735, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/core.py", line 869, in connect
connection_args=self.connection_args,
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/home/kzy816/.local/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: Timed out trying to connect to 'tcp://172.20.129.23:43394' after 10 s: connect() didn't finish in time
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35132 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44606 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44608 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.129.110:44610 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35134 remote=tcp://172.20.129.23:43394>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://172.20.131.51:35136 remote=tcp://172.20.129.23:43394>
I didn’t get useful information by searching over internet.
Problem 3:
On python 2, I also get problem that I cannot get enough workers as I specified in the SLURMCluster()
and using .scale()
method. Besides, I also got error like following. I tried to change the --local_directory
Matthew once mentioned in stackoverflow, but it doesn’t solve the problem.
distributed.diskutils - ERROR - Failed to remove u'/projects/p30309/CD/CD_github/worker-Ra0Bcx' (failed in <built-in function rmdir>): [Errno 2] No such file or directory: '/projects/p30309/CD/CD_github/worker-Ra0Bcx'
distributed.diskutils - ERROR - Failed to remove u'/projects/p30309/CD/CD_github/worker-ZVE8Mw' (failed in <built-in function rmdir>): [Errno 2] No such file or directory: '/projects/p30309/CD/CD_github/worker-ZVE8Mw'
distributed.diskutils - ERROR - Failed to remove u'/projects/p30309/CD/CD_github/worker-ZgTe6A' (failed in <built-in function rmdir>): [Errno 2] No such file or directory: '/projects/p30309/CD/CD_github/worker-ZgTe6A'
distributed.diskutils - ERROR - Failed to remove u'/projects/p30309/CD/CD_github/worker-HRoPNu' (failed in <built-in function listdir>): [Errno 2] No such file or directory: '/projects/p30309/CD/CD_github/worker-HRoPNu'
distributed.diskutils - ERROR - Failed to remove u'/projects/p30309/CD/CD_github/worker-IF84m_' (failed in <built-in function listdir>): [Errno 2] No such file or directory: '/projects/p30309/CD/CD_github/worker-IF84m_'
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.diskutils - ERROR - Failed to remove u'/projects/p30309/CD/CD_github/worker-IF84m_' (failed in <built-in function rmdir>): [Errno 2] No such file or directory: '/projects/p30309/CD/CD_github/worker-IF84m_'
distributed.diskutils - ERROR - Failed to remove u'/projects/p30309/CD/CD_github/worker-HRoPNu' (failed in <built-in function rmdir>): [Errno 2] No such file or directory: '/projects/p30309/CD/CD_github/worker-HRoPNu'
Comparing with the problem of not getting enough workers, this may be minor issue, even though I can get results.
Summary : I guess dask.distributed is trying to help distributed computation for interactive cases. I am not sure if this is correct way of using dask.distributed for batch submission. I want to be able to use joblib in the client of dask so that my joblib parallelization doesn’t have to be constrained in one node of the cluster. I would really appreciate, if anyone can share some best practices and some insights about why the problems are here. Thank you!
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:10 (5 by maintainers)
Top GitHub Comments
Hi, I have encountered the same error OSError: Timed out trying to connect to ‘tcp://127.0.0.1:54234’ after 10 s: connect() didn’t finish in time. In my opinion, it is related to size of dask data frame saved as parquet. The variable n=1000 sets, size of the dask table frame. When n is smaller error does not show.
here there is reproducible code on my machine, Macbookpro-16GB macOS Catalina:
@jakirkham I can do it, but probably not recently, because I am too busy with my research now. Sorry.