KeyError: ('error', 'waiting')
See original GitHub issueWhat happened:
Sometimes the worker logs indicate the following KeyError: ('error', 'waiting')
:
ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x0000000003C90C88>>, <Task finished coro=<Worker.handle_scheduler() done, defined at \miniconda3\lib\site-packages\distributed\worker.py:997> exception=KeyError(('error', 'waiting'))>)
Traceback (most recent call last):
File "miniconda3\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
ret = callback()
File "miniconda3\lib\site-packages\tornado\ioloop.py", line 765, in _discard_future_result
future.result()
File "\miniconda3\lib\site-packages\distributed\worker.py", line 1000, in handle_scheduler
comm, every_cycle=[self.ensure_communicating, self.ensure_computing]
File "miniconda3\lib\site-packages\distributed\core.py", line 573, in handle_stream
handler(**merge(extra, msg))
File "\miniconda3\lib\site-packages\distributed\worker.py", line 1502, in add_task
self.transition(ts, "waiting", runspec=runspec)
File "\miniconda3\lib\site-packages\distributed\worker.py", line 1602, in transition
func = self._transitions[start, finish]
KeyError: ('error', 'waiting')
right before the worker seems to lose connection, eventually the TTL lapses and the worker dies. Simultaneous to this, the scheduler logs indicate the following
ERROR - 2021-05-08 02:18:25 - distributed.utils.log_errors.l673 - 'TASK_KEY'
Traceback (most recent call last):
File "\miniconda3\lib\site-packages\distributed\utils.py", line 668, in log_errors
yield
File "\miniconda3\lib\site-packages\distributed\scheduler.py", line 3986, in add_worker
typename=types[key],
KeyError: 'TASK_KEY'
ERROR - 2021-05-08 02:18:25 - distributed.core.handle_comm.l507 - Exception while handling op register-worker
Traceback (most recent call last):
File "\miniconda3\lib\site-packages\distributed\core.py", line 501, in handle_comm
result = await result
File "\miniconda3\lib\site-packages\distributed\scheduler.py", line 3986, in add_worker
typename=types[key],
KeyError: 'TASK_KEY'
I’m not totally sure how to handle this, so I’ve put in some fixes where we handle errors more generally. However, when I pull at the worker source code, i see this line: https://github.com/dask/distributed/blob/0a014dac4a1edb090ee17027fabcd41cdd015553/distributed/worker.py#L1496
which handles if the status is set to erred
but not error
. Could this be an oversight? Or is there something else I’m missing? I’m not clear on whether this will properly transition the task, and then perhaps the scheduler types lookup above might not fail. Or, could the scheduler type lookup need to handle these key errors?=
Anything else we need to know?: I’m sorry I once again do not have a minimal working example. I hope this is enough to go on. Environment:
- Dask version: 2021.04.1
- Python version: 3.7
- Operating System: windows
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:9 (5 by maintainers)
Top GitHub Comments
got it. After removing some of our “strategic” errors (exceptions there to supersede the need to schedule downstream tasks), these issues stopped occurring, so it does indeed appear to be the root cause of this. Hope this helped, and thanks for looking into this!
For reference, the above keyerror will be fixed in https://github.com/dask/distributed/pull/5103 and was reported in https://github.com/dask/distributed/issues/5078 as a source of a deadlock