TimeoutError: Timeout -- /tornado/gen.py in run(self)
See original GitHub issueI am starting to run some computations in Cori @ NERSC, using their jupyter notebooks. Below, I am sending you a screenshot with how I build the Client
using SLURM
.
As you can see, I am using dask_jobqueue
to create a SLURM script that runs on Cori. The Client
is imported from distributed
.
This is related to https://github.com/dask/distributed/issues/2581 where I am deploying pytorch models using dask. The error I get is the following:
---------------------------------------------------------------------------
TimeoutError Traceback (most recent call last)
<ipython-input-8-f971315bc1bd> in <module>
13 'amsgrad': True})
14
---> 15 client = Client(cluster)
16 calc = model()
17 calc.train(training_set=images, epochs=epochs,
~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, **kwargs)
710 ext(self)
711
--> 712 self.start(timeout=timeout)
713
714 from distributed.recreate_exceptions import ReplayExceptionClient
~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/client.py in start(self, **kwargs)
856 self._started = self._start(**kwargs)
857 else:
--> 858 sync(self.loop, self._start, **kwargs)
859
860 def __await__(self):
~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
329 e.wait(10)
330 if error[0]:
--> 331 six.reraise(*error[0])
332 else:
333 return result[0]
~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/utils.py in f()
314 if timeout is not None:
315 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 316 result[0] = yield future
317 except Exception as exc:
318 error[0] = sys.exc_info()
/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
1097
1098 try:
-> 1099 value = future.result()
1100 except Exception:
1101 self.had_exception = True
/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
1105 if exc_info is not None:
1106 try:
-> 1107 yielded = self.gen.throw(*exc_info)
1108 finally:
1109 # Break up a reference to itself
~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
952 self.scheduler_comm = None
953
--> 954 yield self._ensure_connected(timeout=timeout)
955
956 for pc in self._periodic_callbacks.values():
/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
1097
1098 try:
-> 1099 value = future.result()
1100 except Exception:
1101 self.had_exception = True
/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
1105 if exc_info is not None:
1106 try:
-> 1107 yielded = self.gen.throw(*exc_info)
1108 finally:
1109 # Break up a reference to itself
~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/client.py in _ensure_connected(self, timeout)
1013 if timeout is not None:
1014 yield gen.with_timeout(
-> 1015 timedelta(seconds=timeout), self._update_scheduler_info()
1016 )
1017 else:
/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
1097
1098 try:
-> 1099 value = future.result()
1100 except Exception:
1101 self.had_exception = True
TimeoutError: Timeout
Just for the record, the other part of the script is attached, too:
What would be the cause of this? I was running the same script yesterday, and it worked out up to some point until I got this other error:
elkhati@cori06:~/ml4chem> cat slurm-21953715.out
distributed.nanny - INFO - Start Nanny at: 'tcp://10.128.6.8:41834'
distributed.diskutils - ERROR - Failed to clean up lingering worker directories in path: %s
Traceback (most recent call last):
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/diskutils.py", line 239, in new_work_dir
self._purge_leftovers()
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/diskutils.py", line 146, in _purge_leftovers
lock.acquire()
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 196, in acquire
self._lock.acquire(self._timeout, self._retry_period)
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 125, in acquire
lock.acquire(timeout, retry_period)
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 175, in acquire
path=self._path,
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 108, in _acquire_non_blocking
success = acquire()
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 172, in <lambda>
acquire=lambda: _lock_file_non_blocking(self._file),
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 64, in _lock_file_non_blocking
fcntl.flock(file_.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 524] Unknown error 524
distributed.diskutils - ERROR - Could not acquire workspace lock on path: /global/u2/m/melkhati/ml4chem/worker-1k80c1c2.dirlock .Continuing without lock. This may result in workspaces not being cleaned up
Traceback (most recent call last):
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/diskutils.py", line 57, in __init__
with workspace._global_lock():
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 202, in __enter__
self.acquire()
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 196, in acquire
self._lock.acquire(self._timeout, self._retry_period)
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 125, in acquire
lock.acquire(timeout, retry_period)
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 169, in acquire
_lock_file_blocking(self._file)
File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 60, in _lock_file_blocking
fcntl.flock(file_.fileno(), fcntl.LOCK_EX)
OSError: [Errno 524] Unknown error 524
distributed.worker - INFO - Start worker at: tcp://10.128.6.8:33767
distributed.worker - INFO - Listening to: tcp://10.128.6.8:33767
distributed.worker - INFO - nanny at: 10.128.6.8:41834
distributed.worker - INFO - bokeh at: 10.128.6.8:45394
distributed.worker - INFO - Waiting to connect to: tcp://128.55.224.44:38815
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 10.00 GB
distributed.worker - INFO - Local Directory: /global/u2/m/melkhati/ml4chem/worker-1k80c1c2
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://128.55.224.44:38815
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 1.97s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
slurmstepd: error: *** JOB 21953715 ON nid01531 CANCELLED AT 2019-06-01T18:41:08 DUE TO TIME LIMIT ***
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.6.8:41834'
distributed.dask_worker - INFO - Exiting on signal 15
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
I would appreciate any input and advice you would have about this so that I can try to fix it.
Thanks.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
python 3.x - TimeoutError: Worker failed to start - Stack Overflow
The problem was solved by updating packages dask,distributed,tornado to version respectively 2.4.0 , 2.4.0, 6.0.3.
Read more >Spawn failed: Timeout even when start_timeout is set to 3600 ...
Hi, I'm running Jupyterhub on an Azure cluster with node autoscaling ... So the message TimeoutError: pod/jupyter-sam did not start in 3600 ...
Read more >tornado.gen — Tornado 5.1.1 documentation
"""``tornado.gen`` implements generator-based coroutines. .. note:: The "decorator ... raise_exc_info, TimeoutError try: try: # py34+ from functools import ...
Read more >Move to async/await (from @gen.coroutine) in Dask client
/opt/lsst/software/stack/python/miniconda3-4.5.12/envs/lsst-scipipe-1172c30/lib/python3.7/site-packages/tornado/gen.py in run(self)
Read more >Tornado.util.TimeoutError timed out after 20 s - Biostars
... /tornado/gen.py", line 748, in run yielded = self.gen.send(value) ... TimeoutError: Worker failed to start tornado.application - ERROR ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Are there any remaining issues here @muammar?
cc-ing @jcrist (in case the borderline deployment question above is of interest 😉