Better handling of tasks that cause the worker to exit
See original GitHub issueThe following seems to hang forever:
>>> from distributed import Executor
>>> import sys
>>> e = Executor('localhost:8786')
>>> e.submit(sys.exit, 1).result()
in the background this causes the dscheduler logs to output many errors such as:
distributed.scheduler - ERROR - Stream is closed
Traceback (most recent call last):
File "/home/ogrisel/code/distributed/distributed/scheduler.py", line 1545, in heartbeat
io_loop=self.loop)
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "/home/ogrisel/code/distributed/distributed/core.py", line 439, in send_recv_from_rpc
result = yield send_recv(stream=stream, op=key, **kwargs)
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "/home/ogrisel/code/distributed/distributed/core.py", line 333, in send_recv
response = yield read(stream)
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "/home/ogrisel/code/distributed/distributed/core.py", line 241, in read
n_frames = yield stream.read_bytes(8)
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
tornado.iostream.StreamClosedError: Stream is closed
I think each worker process should have a unique uuid4
that is passed as a metadata to any Future object returned by those workers and the nannies should report to the scheduler the uuid4
of any process that crashed when executing a task so that the scheduler can ban those tasks and update the Future objects to mark them as failed with an informative error message instead of hanging for ever.
Issue Analytics
- State:
- Created 7 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
8 Things Leaders Do That Make Employees Quit
Mistake 1: Setting inconsistent goals or expectations. · Mistake 2: Having too many process constraints. · Mistake 3: Wasting your resources.
Read more >3 strategies more effective than firing a bad employee - CNBC
Managing out an underperforming employee can be a more effective strategy than simply firing them, according to management experts.
Read more >Six Key Elements of an Employee Exit Plan - Propel HR
Ensure every department that is affected by the exiting employee's role understands how that function or task will now be handled and by...
Read more >How to Fire an Employee: Tips for Letting Go
These 15 tips should make the process a little bit easier. 1. Give the employee the opportunity to improve (or leave) first. “Realizing...
Read more >16 Reasons Why Employees Choose To Leave Their Jobs
1. Needing more of a challenge · 2. Looking for a higher salary · 3. Feeling uninspired · 4. Wanting to feel valued...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
IPython will, yes, but I’m not sure it should. Engine death is the only case I have encountered in the real world where I have seen a task that should be retried, so handling only that makes the most sense to me, at least as a starting point. I thought some things like temporary resource availability issues might come up, but they don’t seem to, so I would probably hold off on supporting that until someone shows up with a real need for it (and a PR!).
@minrk if retries is given and the exception is more mundane, like
ZeroDivisionError
, will IPyParallel still retry?