SGE timeout issue
See original GitHub issueI’m having some problems when trying to run jobs using dask_jobqueue. I can’t really figure out what piece I’m missing.
I’m trying to run this:
from dask_jobqueue import SGECluster
from distributed import Client
from dask import delayed
import time
@delayed
def example_function(thingy):
time.sleep(2)
print('Example')
return thingy
if __name__=='__main__':
cluster = SGECluster(queue='myq',
project='myproj',
processes=1,
cores=1,
memory='4GB',
death_timeout=100,
resource_spec='m_mem_free=4G')
print(cluster.job_script())
cluster.start_workers(1)
i=2
future = example_function(i)
client = Client(cluster)
client.compute(future)
print('Finished')
The job is submitted properly, but is timing out. Seems like they can’t connect to the scheduler
(dasktest) ono@host01:~/git/dasktest$ cat dask-worker.o758174
distributed.nanny - INFO - Start Nanny at: 'tcp://56.172.568.215:43322'
distributed.worker - INFO - Start worker at: tcp://56.172.568.215:37385
distributed.worker - INFO - Listening to: tcp://56.172.568.215:37385
distributed.worker - INFO - dashboard at: 56.172.568.215:36952
distributed.worker - INFO - Waiting to connect to: tcp://56.142.76.231:42401
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 4.00 GB
distributed.worker - INFO - Local Directory: /home/DOM/ono/git/dasktest/worker-mxsadt3h
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://56.142.76.231:42401
distributed.nanny - INFO - Closing Nanny at 'tcp://56.172.568.215:43322'
distributed.worker - INFO - Stopping worker at tcp://56.172.568.215:37385
distributed.worker - INFO - Closed worker has not yet started: None
distributed.nanny - ERROR - Timed out connecting Nanny '<Nanny: None, threads: 1>' to scheduler 'tcp://56.142.76.231:42401'
Traceback (most recent call last):
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/nanny.py", line 309, in instantiate
timedelta(seconds=self.death_timeout), self.process.start()
tornado.util.TimeoutError: Timeout
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 456, in <module>
go()
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 406, in go
main()
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/DOM/ono/miniconda3/envs/dasktest/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 397, in main
raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.
I guess I’m not completely getting whether I need to submit an independent job to start a scheduler or whether that’s handled internally. I was assuming it was handled internally.
What am I missing?
Many thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
Timeout jobs on sun grid engine - Stack Overflow
I would like to stop jobs that run for more than, say, 2 hours. Is it possible to run using SGE? Is it...
Read more >Connection Timeout Expired. Alot. - Sage 100 Contractor ...
It is taking too long to connect to the SQL server before the set timeout period expires. This can be caused by a...
Read more >Chapter 7 Error Messages, and Troubleshooting
This chapter describes the error messaging procedures of the grid engine system and offers tips on how to resolve various common problems.
Read more >Error: "Connection Timeout Expired. The timeout period ...
It is taking too long to connect to the SQL server before the set timeout period expires. This can be caused by several...
Read more >sge_conf(5): Sun Grid Engine config files - Linux man page
sge_conf defines the global and local Sun Grid Engine configurations and can ... try to re-start once, if the timeout is reached again...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It this is not too complicated, it’d be great if we could transfer this issue to the dask/dask-jobqueue repo (see this for more details).
If I understand correctly this we need someone with both admin rights on
distributed
anddask-jobqueue
.This is definitly a dask-jobqueue issue.
Job submission is handled internally. You seem to have a network problem, the compute nodes where your workers are running are not able to see the Scheduler address.
You need to use either the
interface
kwarg which allows specifying both network interface to be used by Scheduler host (the host where you start your main python script) and workers hosts (your compute nodes) orip
kwarg to specify the IP or hostname the Scheduler host should bind to. Interface is often something likeib0
if your cluster use Infiniband network. You may need to use both kwarg.Another solution is to start the Scheduler and so your script from a compute node, for example in an interactive job.