question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError: ('error', 'waiting')

See original GitHub issue

What happened: Sometimes the worker logs indicate the following KeyError: ('error', 'waiting'):

ERROR:tornado.application:Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x0000000003C90C88>>, <Task finished coro=<Worker.handle_scheduler() done, defined at \miniconda3\lib\site-packages\distributed\worker.py:997> exception=KeyError(('error', 'waiting'))>)
Traceback (most recent call last):
  File "miniconda3\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
    ret = callback()
  File "miniconda3\lib\site-packages\tornado\ioloop.py", line 765, in _discard_future_result
    future.result()
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1000, in handle_scheduler
    comm, every_cycle=[self.ensure_communicating, self.ensure_computing]
  File "miniconda3\lib\site-packages\distributed\core.py", line 573, in handle_stream
    handler(**merge(extra, msg))
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1502, in add_task
    self.transition(ts, "waiting", runspec=runspec)
  File "\miniconda3\lib\site-packages\distributed\worker.py", line 1602, in transition
    func = self._transitions[start, finish]
KeyError: ('error', 'waiting')

right before the worker seems to lose connection, eventually the TTL lapses and the worker dies. Simultaneous to this, the scheduler logs indicate the following

ERROR - 2021-05-08 02:18:25 - distributed.utils.log_errors.l673 - 'TASK_KEY'
Traceback (most recent call last):
  File "\miniconda3\lib\site-packages\distributed\utils.py", line 668, in log_errors
    yield
  File "\miniconda3\lib\site-packages\distributed\scheduler.py", line 3986, in add_worker
    typename=types[key],
KeyError: 'TASK_KEY'
ERROR - 2021-05-08 02:18:25 - distributed.core.handle_comm.l507 - Exception while handling op register-worker
Traceback (most recent call last):
  File "\miniconda3\lib\site-packages\distributed\core.py", line 501, in handle_comm
    result = await result
  File "\miniconda3\lib\site-packages\distributed\scheduler.py", line 3986, in add_worker
    typename=types[key],
KeyError: 'TASK_KEY'

I’m not totally sure how to handle this, so I’ve put in some fixes where we handle errors more generally. However, when I pull at the worker source code, i see this line: https://github.com/dask/distributed/blob/0a014dac4a1edb090ee17027fabcd41cdd015553/distributed/worker.py#L1496 which handles if the status is set to erred but not error. Could this be an oversight? Or is there something else I’m missing? I’m not clear on whether this will properly transition the task, and then perhaps the scheduler types lookup above might not fail. Or, could the scheduler type lookup need to handle these key errors?=

Anything else we need to know?: I’m sorry I once again do not have a minimal working example. I hope this is enough to go on. Environment:

  • Dask version: 2021.04.1
  • Python version: 3.7
  • Operating System: windows
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mderingcommented, May 14, 2021

got it. After removing some of our “strategic” errors (exceptions there to supersede the need to schedule downstream tasks), these issues stopped occurring, so it does indeed appear to be the root cause of this. Hope this helped, and thanks for looking into this!

0reactions
fjettercommented, Jul 22, 2021

For reference, the above keyerror will be fixed in https://github.com/dask/distributed/pull/5103 and was reported in https://github.com/dask/distributed/issues/5078 as a source of a deadlock

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trying to use wait with concurrent futures to append to a ...
If you only get a keyerror 90% of the time, it suggests something is occasionally failing (or perhaps timing out). Maybe try looping...
Read more >
I'm getting Key error in python- Keyerror: 'rules'
I am getting KeyError in my python program for the code in 1st line: ... It is better to wait for the issue...
Read more >
KeyError: 'fasterrcnn_resnet50' - W&B Help - WandB community
I am trying to run this GitHub repository faster rcnn-pytorch-custom-dataset but I got this error. Building model from scratch.
Read more >
KeyError: checkpoint · Issue #1244 · snakemake ... - GitHub
I get the following output and it appears that all is well: Waiting at most 60 seconds for missing files. MissingOutputException in line...
Read more >
The problem with using KeyError for non-existing resources ...
As KeyError is the protocol for dictionaries and lists in python by default so I think it's a pretty reasonable design for most...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found