Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better handling of tasks that cause the worker to exit

See original GitHub issue

The following seems to hang forever:

>>> from distributed import Executor
>>> import sys
>>> e = Executor('localhost:8786')
>>> e.submit(sys.exit, 1).result()

in the background this causes the dscheduler logs to output many errors such as:

distributed.scheduler - ERROR - Stream is closed
Traceback (most recent call last):
  File "/home/ogrisel/code/distributed/distributed/scheduler.py", line 1545, in heartbeat
    io_loop=self.loop)
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/ogrisel/code/distributed/distributed/core.py", line 439, in send_recv_from_rpc
    result = yield send_recv(stream=stream, op=key, **kwargs)
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/ogrisel/code/distributed/distributed/core.py", line 333, in send_recv
    response = yield read(stream)
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/ogrisel/code/distributed/distributed/core.py", line 241, in read
    n_frames = yield stream.read_bytes(8)
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
tornado.iostream.StreamClosedError: Stream is closed

I think each worker process should have a unique uuid4 that is passed as a metadata to any Future object returned by those workers and the nannies should report to the scheduler the uuid4 of any process that crashed when executing a task so that the scheduler can ban those tasks and update the Future objects to mark them as failed with an informative error message instead of hanging for ever.

Issue Analytics

State:
Created 7 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

minrkcommented, Jun 4, 2016

IPython will, yes, but I’m not sure it should. Engine death is the only case I have encountered in the real world where I have seen a task that should be retried, so handling only that makes the most sense to me, at least as a starting point. I thought some things like temporary resource availability issues might come up, but they don’t seem to, so I would probably hold off on supporting that until someone shows up with a real need for it (and a PR!).

0reactions

mrocklincommented, Jun 3, 2016

@minrk if retries is given and the exception is more mundane, like ZeroDivisionError, will IPyParallel still retry?