fire_and_forget scheduler error in 2.13.0
See original GitHub issueHello, friends!
I’ve been able to reproduce a problem originally described in #3551 (now closed with a partial fix) and #3465. fire_and_forget
seems to still raise an exception, and the simple example below should
reproduce it.
Run a 1-worker cluster:
docker run -it --network host daskdev/dask:2.11.0 dask-scheduler
docker run -it --network host daskdev/dask:2.11.0 dask-worker localhost:8786
Run a python shell:
virtualenv .venv
source .venv/bin/activate
pip install dask distributed
python
Here are the libraries now in use:
$ pip freeze
click==7.1.1
cloudpickle==1.3.0
dask==2.13.0
distributed==2.13.0
HeapDict==1.0.1
msgpack==1.0.0
pkg-resources==0.0.0
psutil==5.7.0
PyYAML==5.3.1
sortedcontainers==2.1.0
tblib==1.6.0
toolz==0.10.0
tornado==6.0.4
zict==2.0.0
Before running the example code, docker stats
will show something similar to:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
66fc4b11267c serene_nash 1.12% 68.76MiB / 62.91GiB 0.11% 0B / 0B 0B / 0B 3
22c2033232aa nice_spence 1.90% 107.4MiB / 62.91GiB 0.17% 0B / 0B 0B / 0B 12
[NOTE: "serene_nash" is the scheduler.]
The following small example will create a DAG and send it via fire_and_forget()
:
import time
import random
import dask.distributed
import logging
def inc(x):
logging.warning(f'inc: {x}')
return x + 1
def repro():
prefix = random.random()
def make_key(i):
return f'{prefix}_{i}' # defeat result caching in dask
nodes = 10000
dsk = {make_key(k): (inc, k) for k in range(nodes)}
dsk['result1'] = (sum, [make_key(k) for k in range(0, nodes, 2)])
dsk['result2'] = (sum, [make_key(k) for k in range(1, nodes, 2)])
dsk['final_result'] = (sum, ['result1', 'result2'])
with dask.config.set({"distributed.comm.compression": "lz4"}):
client = dask.distributed.Client("tcp://localhost:8786")
f = client.get(dsk, 'final_result', sync=False)
dask.distributed.fire_and_forget(f)
repro()
Once this example is run two things will have happened: the scheduler has thrown an exception (“Error transitioning ‘final_result’ from ‘processing’ to ‘memory’”) and the amount of memory held by the scheduler process will have gone up. The exception from the scheduler:
distributed.scheduler - ERROR - Error transitioning 'final_result' from 'processing' to 'memory'
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 4655, in transition
ts.prefix.groups.remove(tg)
ValueError: list.remove(x): x not in list
distributed.core - ERROR - list.remove(x): x not in list
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 473, in handle_stream
handler(**merge(extra, msg))
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2597, in handle_task_finished
r = self.stimulus_task_finished(key=key, worker=worker, **msg)
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2015, in stimulus_task_finished
recommendations = self.transition(key, "memory", worker=worker, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 4655, in transition
ts.prefix.groups.remove(tg)
ValueError: list.remove(x): x not in list
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:33239', name: tcp://127.0.0.1:33239, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:33239
distributed.scheduler - INFO - Lost all workers
distributed.utils - ERROR - list.remove(x): x not in list
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 665, in log_errors
yield
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 1739, in add_worker
await self.handle_worker(comm=comm, worker=address)
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2694, in handle_worker
await self.handle_stream(comm=comm, extra={"worker": worker})
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 473, in handle_stream
handler(**merge(extra, msg))
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2597, in handle_task_finished
r = self.stimulus_task_finished(key=key, worker=worker, **msg)
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2015, in stimulus_task_finished
recommendations = self.transition(key, "memory", worker=worker, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 4655, in transition
ts.prefix.groups.remove(tg)
ValueError: list.remove(x): x not in list
distributed.core - ERROR - list.remove(x): x not in list
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 412, in handle_comm
result = await result
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 1739, in add_worker
await self.handle_worker(comm=comm, worker=address)
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2694, in handle_worker
await self.handle_stream(comm=comm, extra={"worker": worker})
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 473, in handle_stream
handler(**merge(extra, msg))
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2597, in handle_task_finished
r = self.stimulus_task_finished(key=key, worker=worker, **msg)
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2015, in stimulus_task_finished
recommendations = self.transition(key, "memory", worker=worker, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 4655, in transition
ts.prefix.groups.remove(tg)
ValueError: list.remove(x): x not in list
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33239', name: tcp://127.0.0.1:33239, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33239
distributed.core - INFO - Starting established connection
And here is the state of the scheduler memory:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
66fc4b11267c serene_nash 1.06% 161.6MiB / 62.91GiB 0.25% 0B / 0B 0B / 0B 4
22c2033232aa nice_spence 2.24% 160.4MiB / 62.91GiB 0.25% 0B / 0B 0B / 0B 18
[NOTE: "serene_nash" is the scheduler.]
Continuing to run the example code over and over will eventually cause the scheduler process to hit a memory limit and be killed. I plan to open a separate ticket about the memory issue, but I think the above exception still qualifies as a similar problem to the recently-closed #3551.
Thanks as always!
cc @mgh35
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
Hi, thanks for following up. I have not had a chance yet, but I hope to dig into this within a few days.
I have debugged this properly now and I have determined your hunch was originally correct, and this was user error. During processing, the scheduler will retain statistics of the task group in
task_groups
(as well as intask_prefixes
). However, if the keys are not formatted in the specific way dask interprets them, this will result intask_prefixes
tracking the processing information (“memory”, “processing”, “released”, “waiting”) per key, instead of per group. I was using keys of the form “mygroup_myidentifier_myhash”, and there were 200k nodes in the graph. After processing completed,task_prefixes
was huge. By changing to the proper form “mygroup-myidentifier_myhash”, the problem goes away.