Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SGE timeout issue

See original GitHub issue

I’m having some problems when trying to run jobs using dask_jobqueue. I can’t really figure out what piece I’m missing.

I’m trying to run this:

from dask_jobqueue import SGECluster
from distributed import Client
from dask import delayed
import time

@delayed
def example_function(thingy):
    time.sleep(2)
    print('Example')
    return thingy

if __name__=='__main__':
    cluster = SGECluster(queue='myq',
                         project='myproj',
                         processes=1,
                         cores=1,
                         memory='4GB',
                         death_timeout=100,
                         resource_spec='m_mem_free=4G')
    print(cluster.job_script())
    cluster.start_workers(1)
    i=2
    future = example_function(i)
    client = Client(cluster)
    client.compute(future)
    print('Finished')

The job is submitted properly, but is timing out. Seems like they can’t connect to the scheduler

(dasktest) ono@host01:~/git/dasktest$ cat dask-worker.o758174
distributed.nanny - INFO -         Start Nanny at: 'tcp://56.172.568.215:43322'
distributed.worker - INFO -       Start worker at: tcp://56.172.568.215:37385
distributed.worker - INFO -          Listening to: tcp://56.172.568.215:37385
distributed.worker - INFO -          dashboard at:       56.172.568.215:36952
distributed.worker - INFO - Waiting to connect to:  tcp://56.142.76.231:42401
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    4.00 GB
distributed.worker - INFO -       Local Directory: /home/DOM/ono/git/dasktest/worker-mxsadt3h
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:  tcp://56.142.76.231:42401
distributed.nanny - INFO - Closing Nanny at 'tcp://56.172.568.215:43322'
distributed.worker - INFO - Stopping worker at tcp://56.172.568.215:37385
distributed.worker - INFO - Closed worker has not yet started: None
distributed.nanny - ERROR - Timed out connecting Nanny '<Nanny: None, threads: 1>' to scheduler 'tcp://56.142.76.231:42401'
Traceback (most recent call last):
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/nanny.py", line 309, in instantiate
    timedelta(seconds=self.death_timeout), self.process.start()
tornado.util.TimeoutError: Timeout
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 456, in <module>
    go()
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 406, in go
    main()
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 764, in __call__


    return self.main(*args, **kwargs)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 397, in main
    raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.

I guess I’m not completely getting whether I need to submit an independent job to start a scheduler or whether that’s handled internally. I was assuming it was handled internally.

What am I missing?

Many thanks!

Issue Analytics

State:
Created 4 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

lestevecommented, Aug 7, 2019

It this is not too complicated, it’d be great if we could transfer this issue to the dask/dask-jobqueue repo (see this for more details).

If I understand correctly this we need someone with both admin rights on distributed and dask-jobqueue.

1reaction

guillaumeebcommented, Aug 7, 2019

This is definitly a dask-jobqueue issue.

Job submission is handled internally. You seem to have a network problem, the compute nodes where your workers are running are not able to see the Scheduler address.

You need to use either the interface kwarg which allows specifying both network interface to be used by Scheduler host (the host where you start your main python script) and workers hosts (your compute nodes) or ip kwarg to specify the IP or hostname the Scheduler host should bind to. Interface is often something like ib0 if your cluster use Infiniband network. You may need to use both kwarg.

Another solution is to start the Scheduler and so your script from a compute node, for example in an interactive job.