Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fire_and_forget scheduler error in 2.13.0

See original GitHub issue

Hello, friends!

I’ve been able to reproduce a problem originally described in #3551 (now closed with a partial fix) and #3465. fire_and_forget seems to still raise an exception, and the simple example below should reproduce it.

Run a 1-worker cluster:

docker run -it --network host daskdev/dask:2.11.0 dask-scheduler
docker run -it --network host daskdev/dask:2.11.0 dask-worker localhost:8786

Run a python shell:

virtualenv .venv
source .venv/bin/activate

pip install dask distributed
python

Here are the libraries now in use:

$ pip freeze
click==7.1.1
cloudpickle==1.3.0
dask==2.13.0
distributed==2.13.0
HeapDict==1.0.1
msgpack==1.0.0
pkg-resources==0.0.0
psutil==5.7.0
PyYAML==5.3.1
sortedcontainers==2.1.0
tblib==1.6.0
toolz==0.10.0
tornado==6.0.4
zict==2.0.0

Before running the example code, docker stats will show something similar to:

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
66fc4b11267c        serene_nash         1.12%               68.76MiB / 62.91GiB   0.11%               0B / 0B             0B / 0B             3
22c2033232aa        nice_spence         1.90%               107.4MiB / 62.91GiB   0.17%               0B / 0B             0B / 0B             12

[NOTE: "serene_nash" is the scheduler.]

The following small example will create a DAG and send it via fire_and_forget():

import time
import random
import dask.distributed
import logging

def inc(x):
    logging.warning(f'inc: {x}')
    return x + 1

def repro():
    prefix = random.random()
    def make_key(i):
        return f'{prefix}_{i}' # defeat result caching in dask
    nodes = 10000
    dsk = {make_key(k): (inc, k) for k in range(nodes)}
    dsk['result1'] = (sum, [make_key(k) for k in range(0, nodes, 2)])
    dsk['result2'] = (sum, [make_key(k) for k in range(1, nodes, 2)])
    dsk['final_result'] = (sum, ['result1', 'result2'])

    with dask.config.set({"distributed.comm.compression": "lz4"}):
        client = dask.distributed.Client("tcp://localhost:8786")
        f = client.get(dsk, 'final_result', sync=False)
        dask.distributed.fire_and_forget(f)

repro()

Once this example is run two things will have happened: the scheduler has thrown an exception (“Error transitioning ‘final_result’ from ‘processing’ to ‘memory’”) and the amount of memory held by the scheduler process will have gone up. The exception from the scheduler:


distributed.scheduler - ERROR - Error transitioning 'final_result' from 'processing' to 'memory'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 4655, in transition
    ts.prefix.groups.remove(tg)
ValueError: list.remove(x): x not in list
distributed.core - ERROR - list.remove(x): x not in list
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 473, in handle_stream
    handler(**merge(extra, msg))
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2597, in handle_task_finished
    r = self.stimulus_task_finished(key=key, worker=worker, **msg)
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2015, in stimulus_task_finished
    recommendations = self.transition(key, "memory", worker=worker, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 4655, in transition
    ts.prefix.groups.remove(tg)
ValueError: list.remove(x): x not in list
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:33239', name: tcp://127.0.0.1:33239, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:33239
distributed.scheduler - INFO - Lost all workers
distributed.utils - ERROR - list.remove(x): x not in list
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 665, in log_errors
    yield
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 1739, in add_worker
    await self.handle_worker(comm=comm, worker=address)
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2694, in handle_worker
    await self.handle_stream(comm=comm, extra={"worker": worker})
  File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 473, in handle_stream
    handler(**merge(extra, msg))
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2597, in handle_task_finished
    r = self.stimulus_task_finished(key=key, worker=worker, **msg)
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2015, in stimulus_task_finished
    recommendations = self.transition(key, "memory", worker=worker, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 4655, in transition
    ts.prefix.groups.remove(tg)
ValueError: list.remove(x): x not in list
distributed.core - ERROR - list.remove(x): x not in list
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 412, in handle_comm
    result = await result
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 1739, in add_worker
    await self.handle_worker(comm=comm, worker=address)
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2694, in handle_worker
    await self.handle_stream(comm=comm, extra={"worker": worker})
  File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 473, in handle_stream
    handler(**merge(extra, msg))
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2597, in handle_task_finished
    r = self.stimulus_task_finished(key=key, worker=worker, **msg)
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 2015, in stimulus_task_finished
    recommendations = self.transition(key, "memory", worker=worker, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/distributed/scheduler.py", line 4655, in transition
    ts.prefix.groups.remove(tg)
ValueError: list.remove(x): x not in list
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33239', name: tcp://127.0.0.1:33239, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33239
distributed.core - INFO - Starting established connection

And here is the state of the scheduler memory:

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
66fc4b11267c        serene_nash         1.06%               161.6MiB / 62.91GiB   0.25%               0B / 0B             0B / 0B             4
22c2033232aa        nice_spence         2.24%               160.4MiB / 62.91GiB   0.25%               0B / 0B             0B / 0B             18

[NOTE: "serene_nash" is the scheduler.]

Continuing to run the example code over and over will eventually cause the scheduler process to hit a memory limit and be killed. I plan to open a separate ticket about the memory issue, but I think the above exception still qualifies as a similar problem to the recently-closed #3551.

Thanks as always!

cc @mgh35

Issue Analytics

State:
Created 3 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

dbactualcommented, Apr 14, 2020

Hi, thanks for following up. I have not had a chance yet, but I hope to dig into this within a few days.

0reactions

dbactualcommented, May 1, 2020

I have debugged this properly now and I have determined your hunch was originally correct, and this was user error. During processing, the scheduler will retain statistics of the task group in task_groups (as well as in task_prefixes). However, if the keys are not formatted in the specific way dask interprets them, this will result in task_prefixes tracking the processing information (“memory”, “processing”, “released”, “waiting”) per key, instead of per group. I was using keys of the form “mygroup_myidentifier_myhash”, and there were 200k nodes in the graph. After processing completed, task_prefixes was huge. By changing to the proper form “mygroup-myidentifier_myhash”, the problem goes away.

Top Results From Across the Web

Changelog — Dask.distributed 2022.12.1 documentation

This release changes the default scheduling mode to use queuing. This will significantly reduce cluster memory use in most cases, and generally improve ......

Linter rules - Dart

Error rules. These rules identify possible errors and other mistakes in your code. ... This rule is currently experimental and available as of...

Changelog — Dask.distributed 2.11.0 documentation

Try getting cluster dashboard_link before asking scheduler (GH#4018) ... Error hard when Dask has mismatched versions or lz4 installed ...

Release Notes — ERT 2.38.0b3.dev4+gce638af3 documentation

Apply the fire-and-forget strategy when sending updates to clients (#3531) ... Add details view to simulations failed including error logs (#3290).

Apache Camel Component Reference Red Hat Fuse 7.4

Using JMS as a Dead Letter Channel storing error only ... Scheduler Component Expand section "286. Scheduler Component" Collapse section "286.