KeyError: ('fetch', 'memory') in Dask 2017.7.2 (execution sometimes hangs)
See original GitHub issueWhat happened: When running the code below, Dask reports the following error
distributed.utils - ERROR - ('fetch', 'memory')
Traceback (most recent call last):
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\utils.py", line 638, in log_errors
yield
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 2435, in gather_dep
self.transition(ts, "memory", value=data[d])
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 1716, in transition
func = self._transitions[start, finish]
KeyError: ('fetch', 'memory')
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x0000024984112040>>, <Task finished name='Task-24082' coro=<Worker.gather_dep() done, defined at C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py:2291> exception=KeyError(('fetch', 'memory'))>)
Traceback (most recent call last):
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
ret = callback()
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\tornado\ioloop.py", line 765, in _discard_future_result
future.result()
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 2435, in gather_dep
self.transition(ts, "memory", value=data[d])
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 1716, in transition
func = self._transitions[start, finish]
KeyError: ('fetch', 'memory')
Sometimes the execution hangs with one task left, and sometimes it finishes. Changing the number of workers seems to affect the outcome (i.e., whether the code finishes or hangs).
In addition to the above error, I also get following errors
Traceback (most recent call last):
File "C:\Users\--\Anaconda3\envs\work2\lib\asyncio\tasks.py", line 465, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\comm\core.py", line 322, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "C:\Users\--\Anaconda3\envs\work2\lib\asyncio\tasks.py", line 467, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 2341, in gather_dep
response = await get_data_from_worker(
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 3674, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\utils_comm.py", line 385, in retry_operation
return await retry(
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\utils_comm.py", line 370, in retry
return await coro()
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\worker.py", line 3651, in _get_data
comm = await rpc.connect(worker)
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\core.py", line 1051, in connect
raise exc
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\core.py", line 1035, in connect
comm = await fut
File "C:\Users\--\Anaconda3\envs\work2\lib\site-packages\distributed\comm\core.py", line 326, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:49177 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:49177
Traceback (most recent call last):
File "C:\Users\--\Anaconda3\envs\work2\lib\asyncio\tasks.py", line 465, in wait_for
fut.result()
asyncio.exceptions.CancelledError
distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)
distributed.worker - WARNING - gc.collect() took 9.391s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
distributed.utils_perf - WARNING - full garbage collections took 20% CPU time recently (threshold: 10%)
EDIT - Added some additional errors that happen when running MCV code on fake data.
What you expected to happen: I did not expect such any errors to be reported.
Minimal Complete Verifiable Example:
from dask.distributed import Client
import dask.dataframe as dd
import numpy as np
import pandas as pd
Nrows = 195626892
# create fake data
df = pd.DataFrame(
{'Date':np.random.choice(pd.date_range('2020-01-01', '2021-07-16'), Nrows),
'Value':100*np.random.uniform(size=Nrows).astype('f8')},
index=np.random.randint(1000128, 236822346, size=Nrows).astype('u8')
)
df.index.name = 'Index'
client = Client()
# run Dask
ddf = dd.from_pandas(df, npartitions=53)
ddf = ddf.groupby(['Index', 'Date'])['Value'].count(split_out=8)
ddf = ddf.persist()
EDIT - Added MCV code that will hopefully reproduce the problem.
Anything else we need to know?: In #4721, a user reported that after upgrading to 2021.07.01 they started seeing the same error.
Environment:
- Dask version: 2021.7.2
- Python version: 3.8.10
- Operating System: Window Server 2016 64-bit
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Dask distributed KeyError - python - Stack Overflow
Basically I read in a file and calculate row means. from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=4, memory='24 GB') ...
Read more >Dask Best Practices - Dask documentation
This is a short overview of Dask best practices. This document specifically focuses on best practices that are shared among all of the...
Read more >Using Dask on Ray — Ray 2.2.0
The Dask-on-Ray scheduler can execute any valid Dask graph, ... was kicked off early and referenced by all downstream computations, often via shared...
Read more >"Workers don't have promised key" error and delayed ...
I am attempting to create a somewhat large Dask array (10k x 10k which is ~1GB) many times in parallel using the delayed...
Read more >Understanding Dask Architecture: Client, Scheduler, Workers
If you call a compute function and Dask seems to hang, or you can't see anything happening on the cluster, it's probably due...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
FYI, I’ve also been seeing this issue intermittently on CuPy-backed workflows, I’ll make sure to test the #5157 .
So far, I haven’t seen this issue anymore after #5157 , it seems like it was resolved in my case. I’ll report back should I experience it again. Thanks @fjetter for the quick fix!