Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TimeoutError: Timeout -- /tornado/gen.py in run(self)

See original GitHub issue

I am starting to run some computations in Cori @ NERSC, using their jupyter notebooks. Below, I am sending you a screenshot with how I build the Client using SLURM.

As you can see, I am using dask_jobqueue to create a SLURM script that runs on Cori. The Client is imported from distributed.

This is related to https://github.com/dask/distributed/issues/2581 where I am deploying pytorch models using dask. The error I get is the following:

---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
<ipython-input-8-f971315bc1bd> in <module>
     13                       'amsgrad': True})
     14 
---> 15 client = Client(cluster)
     16 calc = model()
     17 calc.train(training_set=images, epochs=epochs,

~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, **kwargs)
    710             ext(self)
    711 
--> 712         self.start(timeout=timeout)
    713 
    714         from distributed.recreate_exceptions import ReplayExceptionClient

~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/client.py in start(self, **kwargs)
    856             self._started = self._start(**kwargs)
    857         else:
--> 858             sync(self.loop, self._start, **kwargs)
    859 
    860     def __await__(self):

~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    329             e.wait(10)
    330     if error[0]:
--> 331         six.reraise(*error[0])
    332     else:
    333         return result[0]

~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/utils.py in f()
    314             if timeout is not None:
    315                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 316             result[0] = yield future
    317         except Exception as exc:
    318             error[0] = sys.exc_info()

/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1105                     if exc_info is not None:
   1106                         try:
-> 1107                             yielded = self.gen.throw(*exc_info)
   1108                         finally:
   1109                             # Break up a reference to itself

~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
    952         self.scheduler_comm = None
    953 
--> 954         yield self._ensure_connected(timeout=timeout)
    955 
    956         for pc in self._periodic_callbacks.values():

/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1105                     if exc_info is not None:
   1106                         try:
-> 1107                             yielded = self.gen.throw(*exc_info)
   1108                         finally:
   1109                             # Break up a reference to itself

~/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/client.py in _ensure_connected(self, timeout)
   1013             if timeout is not None:
   1014                 yield gen.with_timeout(
-> 1015                     timedelta(seconds=timeout), self._update_scheduler_info()
   1016                 )
   1017             else:

/global/common/cori/software/python/3.6-anaconda-5.2/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

TimeoutError: Timeout

Just for the record, the other part of the script is attached, too:

What would be the cause of this? I was running the same script yesterday, and it worked out up to some point until I got this other error:

elkhati@cori06:~/ml4chem> cat slurm-21953715.out
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.128.6.8:41834'
distributed.diskutils - ERROR - Failed to clean up lingering worker directories in path: %s
Traceback (most recent call last):
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/diskutils.py", line 239, in new_work_dir
    self._purge_leftovers()
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/diskutils.py", line 146, in _purge_leftovers
    lock.acquire()
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 196, in acquire
    self._lock.acquire(self._timeout, self._retry_period)
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 125, in acquire
    lock.acquire(timeout, retry_period)
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 175, in acquire
    path=self._path,
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 108, in _acquire_non_blocking
    success = acquire()
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 172, in <lambda>
    acquire=lambda: _lock_file_non_blocking(self._file),
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 64, in _lock_file_non_blocking
    fcntl.flock(file_.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 524] Unknown error 524
distributed.diskutils - ERROR - Could not acquire workspace lock on path: /global/u2/m/melkhati/ml4chem/worker-1k80c1c2.dirlock .Continuing without lock. This may result in workspaces not being cleaned up
Traceback (most recent call last):
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/diskutils.py", line 57, in __init__
    with workspace._global_lock():
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 202, in __enter__
    self.acquire()
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 196, in acquire
    self._lock.acquire(self._timeout, self._retry_period)
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 125, in acquire
    lock.acquire(timeout, retry_period)
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 169, in acquire
    _lock_file_blocking(self._file)
  File "/global/homes/m/melkhati/.local/cori/3.6-anaconda-4.4/lib/python3.6/site-packages/distributed/locket.py", line 60, in _lock_file_blocking
    fcntl.flock(file_.fileno(), fcntl.LOCK_EX)
OSError: [Errno 524] Unknown error 524
distributed.worker - INFO -       Start worker at:     tcp://10.128.6.8:33767
distributed.worker - INFO -          Listening to:     tcp://10.128.6.8:33767
distributed.worker - INFO -              nanny at:           10.128.6.8:41834
distributed.worker - INFO -              bokeh at:           10.128.6.8:45394
distributed.worker - INFO - Waiting to connect to:  tcp://128.55.224.44:38815
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   10.00 GB
distributed.worker - INFO -       Local Directory: /global/u2/m/melkhati/ml4chem/worker-1k80c1c2
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://128.55.224.44:38815
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 1.97s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
slurmstepd: error: *** JOB 21953715 ON nid01531 CANCELLED AT 2019-06-01T18:41:08 DUE TO TIME LIMIT ***
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.6.8:41834'
distributed.dask_worker - INFO - Exiting on signal 15
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>

I would appreciate any input and advice you would have about this so that I can try to fix it.

Thanks.

Issue Analytics

State:
Created 4 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Jun 4, 2019

Are there any remaining issues here @muammar?

1reaction

jakirkhamcommented, Jun 3, 2019

cc-ing @jcrist (in case the borderline deployment question above is of interest 😉

Top Results From Across the Web

python 3.x - TimeoutError: Worker failed to start - Stack Overflow

The problem was solved by updating packages dask,distributed,tornado to version respectively 2.4.0 , 2.4.0, 6.0.3.

Spawn failed: Timeout even when start_timeout is set to 3600 ...

Hi, I'm running Jupyterhub on an Azure cluster with node autoscaling ... So the message TimeoutError: pod/jupyter-sam did not start in 3600 ...

tornado.gen — Tornado 5.1.1 documentation

"""``tornado.gen`` implements generator-based coroutines. .. note:: The "decorator ... raise_exc_info, TimeoutError try: try: # py34+ from functools import ...

Move to async/await (from @gen.coroutine) in Dask client

/opt/lsst/software/stack/python/miniconda3-4.5.12/envs/lsst-scipipe-1172c30/lib/python3.7/site-packages/tornado/gen.py in run(self)

Tornado.util.TimeoutError timed out after 20 s - Biostars

... /tornado/gen.py", line 748, in run yielded = self.gen.send(value) ... TimeoutError: Worker failed to start tornado.application - ERROR ...