question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected `CancelledFuture` error

See original GitHub issue

Over at https://github.com/PrefectHQ/prefect/issues/943 we’re running into a very interesting situation; we believe we know how to fix it in our own codebase, but I’d like to understand whether this is a fundamentally bad Dask pattern or whether this is a bug.

In short, we are using multiple worker_clients and fire_and_forget, and are seeing CancelledFuture errors in very unexpected circumstances. These errors seem to only occur in very specific circumstances, but not otherwise.

To reproduce, start a local cluster:

dask-scheduler
dask-worker --nanny --nthreads 3 127.0.0.1:8786

and run this script (basically a distillation of how we’re using Dask in our pipeline):

from distributed import Client, fire_and_forget, worker_client

def inc(x):
    return x + 1

def dec(x):
    return x - 1

def mapper(fn, arr):
    with worker_client(separate_thread=True) as client:
        futures = client.map(fn, arr, pure=False)
    fire_and_forget(futures)
    return futures

def waiter(futs):
    with worker_client(separate_thread=True) as client:
        return client.gather(futs)

def pipeline(fn, arr):
    futs = mapper(fn, arr)
    return waiter(futs)

with Client(address='localhost:8786') as client:
    f = client.submit(pipeline, inc, list(range(100)))
    fire_and_forget(f) # this appears to be irrelevant
    g = client.submit(pipeline, dec, list(range(100)))
    fire_and_forget(g) # this appears to be irrelevant
    val = client.gather([f,g])

and you should get a CancelledError.

Notes:

  • if you write the logs from the scheduler and worker to a file, the error doesn’t seem to occur
  • if you run this twice, the error won’t occur; it seems to only occur whenever the scheduler, worker and python process are all “fresh”
  • this is on distributed.__version__ = 1.26.1
  • it appears to be threading related (running with 1 thread doesn’t reproduce)

It seems that removing the fire_and_forget call inside mapper fixes the situation.

cc: @jlowin

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Apr 23, 2019

They are all fairly similar in how they’re implemented. Queues are probably the cleanest. It would be straightforward, I think, to add other coordination primitives to Dask, should you all be interested. These could live within the main project, or probably outside as well if you wanted to manage and evolve them yourselves.

0reactions
cicdwcommented, Apr 24, 2019

Fair; the only thing that still leaves me suspicious is the fact that this always seem to run fine on the second run, but not always on the first run. Not sure if there’s a bug in there, or if that’s just coincidence.

I think a full description of what client / future-sharing setup we’re trying to achieve is beyond the scope of this issue, but I’d be happy to share it if anyone is interested in digging into the Prefect code base 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tornado unexpected exception in Future <Future cancelled ...
You are running sync function in async tornado. Try this one: from dask.distributed import Client from tornado.ioloop import IOLoop async ...
Read more >
Tornado unexpected exception in Future <Future cancelled ...
I have set up a dask cluster. I can access a web dashboard, but when I'm trying to connect to the scheduler: from...
Read more >
ThreadPoolExecutor in Python: The Complete Guide
Worker threads are designed to be re-used once the task is completed and provide protection against the unexpected failure of the task, ...
Read more >
Introduction — Traits Futures 1.1 documentation
If the future has been cancelled, the next call to progress in the background task will raise a TaskCancelled exception. For example, your...
Read more >
Progress versus utopia; or, Can We Imagine the Future?
It would in my opinion be a mistake to make the "apologia" for SF in terms of ... They train our organisms to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found