Nanny Fails to Connect in 60 seconds
See original GitHub issueI am attempting to run some very simple dask code that uses SLURMCluster
to submit jobs to a SLURM job scheduler. However, every job I submit throws the following error: asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds
.
My cluster setup is as follows:
cluster = SLURMCluster(
name='worker_bee',
queue='normal',
project='----------', #hidden for privacy
cores=8,
memory='2GB',
interface='ib0'
)
I confirmed the cluster I am on has ib0 and the nanny is attempting to connect to the ib0 IP address.
I have also confirmed that the job headers written by the cluster work, as a simple python script with the same job headers runs fine when submitted as a job to sbatch
.
The output file from the job is at follows:
distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.206.35:41398'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.worker - INFO - Start worker at: tcp://192.168.206.35:38225
distributed.worker - INFO - Listening to: tcp://192.168.206.35:38225
distributed.worker - INFO - dashboard at: 192.168.206.35:39122
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 8
distributed.worker - INFO - Memory: 2.00 GB
distributed.worker - INFO - Local Directory: /home1/06729/tg860286/tests/dask-rsmas-presentation/dask-worker-space/worker-37h141zz
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.206.35:41398'
distributed.worker - INFO - Stopping worker at tcp://192.168.206.35:38225
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/node.py", line 173, in wait_for
await asyncio.wait_for(future, timeout=timeout)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 440, in <module>
go()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 436, in go
main()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 422, in main
loop.run_sync(run)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 416, in run
await asyncio.gather(*nannies)
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/node.py", line 176, in wait_for
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds
I can find no additional documentation of the potential causes of this error, so any help would be appreciated.
Relevant Packages:
dask 2.11.0 py_0 conda-forge
dask-core 2.11.0 py_0 conda-forge
dask-jobqueue 0.7.0 py_0 conda-forge
Issue Analytics
- State:
- Created 4 years ago
- Comments:24 (12 by maintainers)
Top Results From Across the Web
dask.distributed SLURM cluster Nanny Timeout - Stack Overflow
TimeoutError : Nanny failed to start in 60 seconds . How do I get the nanny to connect? Full Trace: distributed.nanny - INFO...
Read more >Dask cluster with large number of workers gives "asyncio ...
The scheduler is first initialized, then the dask-workers join by different ... TimeoutError: Nanny failed to start in 240 seconds.
Read more >Net Nanny ® 10 Support
Resolution 1 (most common) Verify that the Net Nanny child app is running. If it has stopped, please start it and check to...
Read more >Source code for distributed.nanny - Dask documentation
TimeoutError: logger.error( "Timed out connecting Nanny '%s' to scheduler '%s'", self, self.scheduler_addr, ) await self.close( timeout=self.death_timeout, ...
Read more >Miku Pro Product Support
Unplug cable from back of Miku, wait 60 seconds · Plug cable back into Miku, wait 30 seconds after solid · In your...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So for completeness: after further investigation @lukeconibear’s issue was that the interactive job was killed because it was using too many resources. In his SGE setup
qacct -j <his_job_id>
was useful to troubleshoot. I am guessing that a similar thing exists for non-SGE clusters.Thanks for your feed-back I suggest we moved to https://github.com/dask/distributed/issues/3857 to discuss this issue as there maybe more people with useful feed-back there.