question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError: ('fetch', 'memory') in Dask 2017.7.2 (execution sometimes hangs)

See original GitHub issue

What happened: When running the code below, Dask reports the following error

distributed.utils - ERROR - ('fetch', 'memory')
Traceback (most recent call last):
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\utils.py", line 638, in log_errors
    yield
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 2435, in gather_dep
    self.transition(ts, "memory", value=data[d])
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 1716, in transition
    func = self._transitions[start, finish]
KeyError: ('fetch', 'memory')
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x0000024984112040>>, <Task finished name='Task-24082' coro=<Worker.gather_dep() done, defined at C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py:2291> exception=KeyError(('fetch', 'memory'))>)
Traceback (most recent call last):
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
    ret = callback()
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\tornado\ioloop.py", line 765, in _discard_future_result
    future.result()
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 2435, in gather_dep
    self.transition(ts, "memory", value=data[d])
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 1716, in transition
    func = self._transitions[start, finish]
KeyError: ('fetch', 'memory')

Sometimes the execution hangs with one task left, and sometimes it finishes. Changing the number of workers seems to affect the outcome (i.e., whether the code finishes or hangs).

In addition to the above error, I also get following errors

Traceback (most recent call last):
  File "C:\Users\--\Anaconda3\envs\work2\lib\asyncio\tasks.py", line 465, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\comm\core.py", line 322, in connect
    await asyncio.wait_for(comm.write(local_info), time_left())
  File "C:\Users\--\Anaconda3\envs\work2\lib\asyncio\tasks.py", line 467, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 2341, in gather_dep
    response = await get_data_from_worker(
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 3674, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\utils_comm.py", line 385, in retry_operation
    return await retry(
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\utils_comm.py", line 370, in retry
    return await coro()
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 3651, in _get_data
    comm = await rpc.connect(worker)
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\core.py", line 1051, in connect
    raise exc
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\core.py", line 1035, in connect
    comm = await fut
  File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\comm\core.py", line 326, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:49177 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:49177
Traceback (most recent call last):
  File "C:\Users\--\Anaconda3\envs\work2\lib\asyncio\tasks.py", line 465, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)
distributed.worker - WARNING - gc.collect() took 9.391s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
distributed.utils_perf - WARNING - full garbage collections took 20% CPU time recently (threshold: 10%)

EDIT - Added some additional errors that happen when running MCV code on fake data.

What you expected to happen: I did not expect such any errors to be reported.

Minimal Complete Verifiable Example:

from dask.distributed import Client
import dask.dataframe as dd
import numpy as np
import pandas as pd

Nrows = 195626892

# create fake data
df = pd.DataFrame(
  {'Date':np.random.choice(pd.date_range('2020-01-01', '2021-07-16'), Nrows),
   'Value':100*np.random.uniform(size=Nrows).astype('f8')},
  index=np.random.randint(1000128, 236822346, size=Nrows).astype('u8')
  )
df.index.name = 'Index'

client = Client()

# run Dask
ddf = dd.from_pandas(df, npartitions=53)
ddf = ddf.groupby(['Index', 'Date'])['Value'].count(split_out=8)
ddf = ddf.persist()

EDIT - Added MCV code that will hopefully reproduce the problem.

Anything else we need to know?: In #4721, a user reported that after upgrading to 2021.07.01 they started seeing the same error.

Environment:

  • Dask version: 2021.7.2
  • Python version: 3.8.10
  • Operating System: Window Server 2016 64-bit
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
pentschevcommented, Aug 5, 2021

FYI, I’ve also been seeing this issue intermittently on CuPy-backed workflows, I’ll make sure to test the #5157 .

1reaction
pentschevcommented, Aug 6, 2021

So far, I haven’t seen this issue anymore after #5157 , it seems like it was resolved in my case. I’ll report back should I experience it again. Thanks @fjetter for the quick fix!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask distributed KeyError - python - Stack Overflow
Basically I read in a file and calculate row means. from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=4, memory='24 GB') ...
Read more >
Dask Best Practices - Dask documentation
This is a short overview of Dask best practices. This document specifically focuses on best practices that are shared among all of the...
Read more >
Using Dask on Ray — Ray 2.2.0
The Dask-on-Ray scheduler can execute any valid Dask graph, ... was kicked off early and referenced by all downstream computations, often via shared...
Read more >
"Workers don't have promised key" error and delayed ...
I am attempting to create a somewhat large Dask array (10k x 10k which is ~1GB) many times in parallel using the delayed...
Read more >
Understanding Dask Architecture: Client, Scheduler, Workers
If you call a compute function and Dask seems to hang, or you can't see anything happening on the cluster, it's probably due...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found