question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SGE timeout issue

See original GitHub issue

I’m having some problems when trying to run jobs using dask_jobqueue. I can’t really figure out what piece I’m missing.

I’m trying to run this:

from dask_jobqueue import SGECluster
from distributed import Client
from dask import delayed
import time

@delayed
def example_function(thingy):
    time.sleep(2)
    print('Example')
    return thingy

if __name__=='__main__':
    cluster = SGECluster(queue='myq',
                         project='myproj',
                         processes=1,
                         cores=1,
                         memory='4GB',
                         death_timeout=100,
                         resource_spec='m_mem_free=4G')
    print(cluster.job_script())
    cluster.start_workers(1)
    i=2
    future = example_function(i)
    client = Client(cluster)
    client.compute(future)
    print('Finished')

The job is submitted properly, but is timing out. Seems like they can’t connect to the scheduler

(dasktest) ono@host01:~/git/dasktest$ cat dask-worker.o758174
distributed.nanny - INFO -         Start Nanny at: 'tcp://56.172.568.215:43322'
distributed.worker - INFO -       Start worker at: tcp://56.172.568.215:37385
distributed.worker - INFO -          Listening to: tcp://56.172.568.215:37385
distributed.worker - INFO -          dashboard at:       56.172.568.215:36952
distributed.worker - INFO - Waiting to connect to:  tcp://56.142.76.231:42401
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    4.00 GB
distributed.worker - INFO -       Local Directory: /home/DOM/ono/git/dasktest/worker-mxsadt3h
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:  tcp://56.142.76.231:42401
distributed.nanny - INFO - Closing Nanny at 'tcp://56.172.568.215:43322'
distributed.worker - INFO - Stopping worker at tcp://56.172.568.215:37385
distributed.worker - INFO - Closed worker has not yet started: None
distributed.nanny - ERROR - Timed out connecting Nanny '<Nanny: None, threads: 1>' to scheduler 'tcp://56.142.76.231:42401'
Traceback (most recent call last):
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/nanny.py", line 309, in instantiate
    timedelta(seconds=self.death_timeout), self.process.start()
tornado.util.TimeoutError: Timeout
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 456, in <module>
    go()
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 406, in go
    main()
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 764, in __call__


    return self.main(*args, **kwargs)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 397, in main
    raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.

I guess I’m not completely getting whether I need to submit an independent job to start a scheduler or whether that’s handled internally. I was assuming it was handled internally.

What am I missing?

Many thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
lestevecommented, Aug 7, 2019

It this is not too complicated, it’d be great if we could transfer this issue to the dask/dask-jobqueue repo (see this for more details).

If I understand correctly this we need someone with both admin rights on distributed and dask-jobqueue.

1reaction
guillaumeebcommented, Aug 7, 2019

This is definitly a dask-jobqueue issue.

Job submission is handled internally. You seem to have a network problem, the compute nodes where your workers are running are not able to see the Scheduler address.

You need to use either the interface kwarg which allows specifying both network interface to be used by Scheduler host (the host where you start your main python script) and workers hosts (your compute nodes) or ip kwarg to specify the IP or hostname the Scheduler host should bind to. Interface is often something like ib0 if your cluster use Infiniband network. You may need to use both kwarg.

Another solution is to start the Scheduler and so your script from a compute node, for example in an interactive job.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Timeout jobs on sun grid engine - Stack Overflow
I would like to stop jobs that run for more than, say, 2 hours. Is it possible to run using SGE? Is it...
Read more >
Connection Timeout Expired. Alot. - Sage 100 Contractor ...
It is taking too long to connect to the SQL server before the set timeout period expires. This can be caused by a...
Read more >
Chapter 7 Error Messages, and Troubleshooting
This chapter describes the error messaging procedures of the grid engine system and offers tips on how to resolve various common problems.
Read more >
Error: "Connection Timeout Expired. The timeout period ...
It is taking too long to connect to the SQL server before the set timeout period expires. This can be caused by several...
Read more >
sge_conf(5): Sun Grid Engine config files - Linux man page
sge_conf defines the global and local Sun Grid Engine configurations and can ... try to re-start once, if the timeout is reached again...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found