Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError: ('error', 'waiting')

See original GitHub issue

What happened: Sometimes the worker logs indicate the following KeyError: ('error', 'waiting'):

ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x0000000003C90C88>>, <Task finished coro=<Worker.handle_scheduler() done, defined at \miniconda3\lib\site-packages\distributed\worker.py:997> exception=KeyError(('error', 'waiting'))>)
Traceback (most recent call last):
  File "miniconda3\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
    ret = callback()
  File "miniconda3\lib\site-packages\tornado\ioloop.py", line 765, in _discard_future_result
    future.result()
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1000, in handle_scheduler
    comm, every_cycle=[self.ensure_communicating, self.ensure_computing]
  File "miniconda3\lib\site-packages\distributed\core.py", line 573, in handle_stream
    handler(**merge(extra, msg))
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1502, in add_task
    self.transition(ts, "waiting", runspec=runspec)
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1602, in transition
    func = self._transitions[start, finish]
KeyError: ('error', 'waiting')

right before the worker seems to lose connection, eventually the TTL lapses and the worker dies. Simultaneous to this, the scheduler logs indicate the following

ERROR - 2021-05-08 02:18:25 - distributed.utils.log_errors.l673 - 'TASK_KEY'
Traceback (most recent call last):
  File "\miniconda3\lib\site-packages\distributed\utils.py", line 668, in log_errors
    yield
  File "\miniconda3\lib\site-packages\distributed\scheduler.py", line 3986, in add_worker
    typename=types[key],
KeyError: 'TASK_KEY'
ERROR - 2021-05-08 02:18:25 - distributed.core.handle_comm.l507 - Exception while handling op register-worker
Traceback (most recent call last):
  File "\miniconda3\lib\site-packages\distributed\core.py", line 501, in handle_comm
    result = await result
  File "\miniconda3\lib\site-packages\distributed\scheduler.py", line 3986, in add_worker
    typename=types[key],
KeyError: 'TASK_KEY'

I’m not totally sure how to handle this, so I’ve put in some fixes where we handle errors more generally. However, when I pull at the worker source code, i see this line: https://github.com/dask/distributed/blob/0a014dac4a1edb090ee17027fabcd41cdd015553/distributed/worker.py#L1496 which handles if the status is set to erred but not error. Could this be an oversight? Or is there something else I’m missing? I’m not clear on whether this will properly transition the task, and then perhaps the scheduler types lookup above might not fail. Or, could the scheduler type lookup need to handle these key errors?=

Anything else we need to know?: I’m sorry I once again do not have a minimal working example. I hope this is enough to go on. Environment:

Dask version: 2021.04.1
Python version: 3.7
Operating System: windows
Install method (conda, pip, source): conda

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

mderingcommented, May 14, 2021

got it. After removing some of our “strategic” errors (exceptions there to supersede the need to schedule downstream tasks), these issues stopped occurring, so it does indeed appear to be the root cause of this. Hope this helped, and thanks for looking into this!

0reactions

fjettercommented, Jul 22, 2021

For reference, the above keyerror will be fixed in https://github.com/dask/distributed/pull/5103 and was reported in https://github.com/dask/distributed/issues/5078 as a source of a deadlock