question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Task dependency bug in worker

See original GitHub issue

What happened: The following code fails sometimes with an KeyError error (3 out of 4 times). git bisect says the bug was introduced by https://github.com/dask/distributed/pull/4107.

distributed.worker - ERROR - "('split-simple-shuffle-b50994a1a48d47067bc463a19b005e75', 14, 55)"
Traceback (most recent call last):
  File "/home/nfs/mkristensen/repos/distributed/distributed/worker.py", line 1984, in gather_dep
    deps_ts = [self.tasks[key] for key in deps]
  File "/home/nfs/mkristensen/repos/distributed/distributed/worker.py", line 1984, in <listcomp>
    deps_ts = [self.tasks[key] for key in deps]
KeyError: "('split-simple-shuffle-b50994a1a48d47067bc463a19b005e75', 14, 55)"

What you expected to happen: For some reason the worker calls gather_dep() with a set of deps that the task does not depend on. As far as I can see, the client and scheduler maintains task dependencies correctly.

Minimal Complete Verifiable Example:


import pandas as pd
import numpy as np

import dask.dataframe as dd
from dask.dataframe.shuffle import shuffle
from distributed import wait
import dask

from distributed import Client, LocalCluster

nparts = 100
max_branch = 100
data_size = nparts * max_branch


def main(client):
    df = pd.DataFrame({"x": np.arange(data_size)})
    ddf = dd.from_pandas(df, npartitions=nparts)
    ddf = ddf.persist(optimize_graph=False)
    wait(ddf)

    with dask.config.set({"optimization.fuse.active": False}):
        s = shuffle(
            ddf, ddf.x, shuffle="tasks", npartitions=nparts, max_branch=max_branch
        )
        s = s.persist()
        wait(s)


if __name__ == "__main__":
    with LocalCluster(scheduler_port=0, asynchronous=False, n_workers=5) as cluster:
        with Client(cluster, asynchronous=False) as client:
            main(client)

cc. @gforsyth

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
gforsythcommented, Nov 10, 2020

I can confirm the issue. Here’s an inelegant fix while I figure out the correct one:

diff --git a/distributed/worker.py b/distributed/worker.py
index b9bc9b633..e77fd984a 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -1998,7 +1998,7 @@ class Worker(ServerNode):
 
                 # dep states may have changed before gather_dep runs
                 # if a dep is no longer in-flight then don't fetch it
-                deps_ts = [self.tasks[key] for key in deps]
+                deps_ts = [self.tasks.get(key, TaskState(key)) for key in deps]
                 deps_ts = tuple(ts for ts in deps_ts if ts.state == "flight")
                 deps = [d.key for d in deps_ts]
 
0reactions
gforsythcommented, Nov 12, 2020

what is inelegant about this in particular?

I wanted to confirm that this was due to (I think) work stealing and not a request for the “wrong” dependencies. Tried to get a test that would reliably fail on this but didn’t quite get there.

deps_ts = [self.tasks.get(key, None) or TaskState(key) for key in deps]

yes, I think this is the right way to do this, I’ll push up the PR now

Read more comments on GitHub >

github_iconTop Results From Across the Web

Bug when copying project template with task dependencies ...
Blocking/Blocked tasks getting assigned to wrong project/tasks up when quickly creating multiple new projects from same template.
Read more >
Task dependencies in the bug-fixing process - ResearchGate
Download scientific diagram | Task dependencies in the bug-fixing process from ... This research investigates the variety of work practices used in public ......
Read more >
What are dependencies on the project roadmap? | Jira ...
Learn about what dependencies are, and what they can tell you about your plan in Jira Software Cloud.
Read more >
Upgrading your build from Gradle 5.x to 6.0
When Gradle detects problems with task definitions (such as incorrectly ... This fixes an issue where a worker needs to use a dependency...
Read more >
Task Dependency in Zoho Projects
Task dependency is the relationship in which a task relies on one or more tasks to be performed in a certain order before...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found