question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nanny Fails to Connect in 60 seconds

See original GitHub issue

I am attempting to run some very simple dask code that uses SLURMCluster to submit jobs to a SLURM job scheduler. However, every job I submit throws the following error: asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds.

My cluster setup is as follows:

cluster = SLURMCluster(
                           name='worker_bee',
                           queue='normal',
                           project='----------',  #hidden for privacy
                           cores=8,
                           memory='2GB',
                           interface='ib0'
                )

I confirmed the cluster I am on has ib0 and the nanny is attempting to connect to the ib0 IP address.

I have also confirmed that the job headers written by the cluster work, as a simple python script with the same job headers runs fine when submitted as a job to sbatch.

The output file from the job is at follows:

distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.206.35:41398'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.worker - INFO -       Start worker at: tcp://192.168.206.35:38225
distributed.worker - INFO -          Listening to: tcp://192.168.206.35:38225
distributed.worker - INFO -          dashboard at:       192.168.206.35:39122
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                    2.00 GB
distributed.worker - INFO -       Local Directory: /home1/06729/tg860286/tests/dask-rsmas-presentation/dask-worker-space/worker-37h141zz
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:38345
distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.206.35:41398'
distributed.worker - INFO - Stopping worker at tcp://192.168.206.35:38225
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/node.py", line 173, in wait_for
    await asyncio.wait_for(future, timeout=timeout)
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 440, in <module>
    go()
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 436, in go
    main()
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 422, in main
    loop.run_sync(run)
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 416, in run
    await asyncio.gather(*nannies)
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home1/06729/tg860286/miniconda3/envs/daskbase/lib/python3.8/site-packages/distributed/node.py", line 176, in wait_for
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

I can find no additional documentation of the potential causes of this error, so any help would be appreciated.

Relevant Packages:

dask                      2.11.0                     py_0    conda-forge
dask-core                 2.11.0                     py_0    conda-forge
dask-jobqueue             0.7.0                      py_0    conda-forge

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:24 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
lestevecommented, Jun 9, 2020

So for completeness: after further investigation @lukeconibear’s issue was that the interactive job was killed because it was using too many resources. In his SGE setup qacct -j <his_job_id> was useful to troubleshoot. I am guessing that a similar thing exists for non-SGE clusters.

0reactions
lestevecommented, Jun 8, 2020

Thanks for your feed-back I suggest we moved to https://github.com/dask/distributed/issues/3857 to discuss this issue as there maybe more people with useful feed-back there.

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.distributed SLURM cluster Nanny Timeout - Stack Overflow
TimeoutError : Nanny failed to start in 60 seconds . How do I get the nanny to connect? Full Trace: distributed.nanny - INFO...
Read more >
Dask cluster with large number of workers gives "asyncio ...
The scheduler is first initialized, then the dask-workers join by different ... TimeoutError: Nanny failed to start in 240 seconds.
Read more >
Net Nanny ® 10 Support
Resolution 1 (most common) Verify that the Net Nanny child app is running. If it has stopped, please start it and check to...
Read more >
Source code for distributed.nanny - Dask documentation
TimeoutError: logger.error( "Timed out connecting Nanny '%s' to scheduler '%s'", self, self.scheduler_addr, ) await self.close( timeout=self.death_timeout, ...
Read more >
Miku Pro Product Support
Unplug cable from back of Miku, wait 60 seconds · Plug cable back into Miku, wait 30 seconds after solid · In your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found