question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better handling of tasks that cause the worker to exit

See original GitHub issue

The following seems to hang forever:

>>> from distributed import Executor
>>> import sys
>>> e = Executor('localhost:8786')
>>> e.submit(sys.exit, 1).result()

in the background this causes the dscheduler logs to output many errors such as:

distributed.scheduler - ERROR - Stream is closed
Traceback (most recent call last):
  File "/home/ogrisel/code/distributed/distributed/scheduler.py", line 1545, in heartbeat
    io_loop=self.loop)
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/ogrisel/code/distributed/distributed/core.py", line 439, in send_recv_from_rpc
    result = yield send_recv(stream=stream, op=key, **kwargs)
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/ogrisel/code/distributed/distributed/core.py", line 333, in send_recv
    response = yield read(stream)
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/ogrisel/code/distributed/distributed/core.py", line 241, in read
    n_frames = yield stream.read_bytes(8)
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
tornado.iostream.StreamClosedError: Stream is closed

I think each worker process should have a unique uuid4 that is passed as a metadata to any Future object returned by those workers and the nannies should report to the scheduler the uuid4 of any process that crashed when executing a task so that the scheduler can ban those tasks and update the Future objects to mark them as failed with an informative error message instead of hanging for ever.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
minrkcommented, Jun 4, 2016

IPython will, yes, but I’m not sure it should. Engine death is the only case I have encountered in the real world where I have seen a task that should be retried, so handling only that makes the most sense to me, at least as a starting point. I thought some things like temporary resource availability issues might come up, but they don’t seem to, so I would probably hold off on supporting that until someone shows up with a real need for it (and a PR!).

0reactions
mrocklincommented, Jun 3, 2016

@minrk if retries is given and the exception is more mundane, like ZeroDivisionError, will IPyParallel still retry?

Read more comments on GitHub >

github_iconTop Results From Across the Web

8 Things Leaders Do That Make Employees Quit
Mistake 1: Setting inconsistent goals or expectations. · Mistake 2: Having too many process constraints. · Mistake 3: Wasting your resources.
Read more >
3 strategies more effective than firing a bad employee - CNBC
Managing out an underperforming employee can be a more effective strategy than simply firing them, according to management experts.
Read more >
Six Key Elements of an Employee Exit Plan - Propel HR
Ensure every department that is affected by the exiting employee's role understands how that function or task will now be handled and by...
Read more >
How to Fire an Employee: Tips for Letting Go
These 15 tips should make the process a little bit easier. 1. Give the employee the opportunity to improve (or leave) first. “Realizing...
Read more >
16 Reasons Why Employees Choose To Leave Their Jobs
1. Needing more of a challenge · 2. Looking for a higher salary · 3. Feeling uninspired · 4. Wanting to feel valued...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found