Unexpected `CancelledFuture` error
See original GitHub issueOver at https://github.com/PrefectHQ/prefect/issues/943 we’re running into a very interesting situation; we believe we know how to fix it in our own codebase, but I’d like to understand whether this is a fundamentally bad Dask pattern or whether this is a bug.
In short, we are using multiple worker_client
s and fire_and_forget
, and are seeing CancelledFuture
errors in very unexpected circumstances. These errors seem to only occur in very specific circumstances, but not otherwise.
To reproduce, start a local cluster:
dask-scheduler
dask-worker --nanny --nthreads 3 127.0.0.1:8786
and run this script (basically a distillation of how we’re using Dask in our pipeline):
from distributed import Client, fire_and_forget, worker_client
def inc(x):
return x + 1
def dec(x):
return x - 1
def mapper(fn, arr):
with worker_client(separate_thread=True) as client:
futures = client.map(fn, arr, pure=False)
fire_and_forget(futures)
return futures
def waiter(futs):
with worker_client(separate_thread=True) as client:
return client.gather(futs)
def pipeline(fn, arr):
futs = mapper(fn, arr)
return waiter(futs)
with Client(address='localhost:8786') as client:
f = client.submit(pipeline, inc, list(range(100)))
fire_and_forget(f) # this appears to be irrelevant
g = client.submit(pipeline, dec, list(range(100)))
fire_and_forget(g) # this appears to be irrelevant
val = client.gather([f,g])
and you should get a CancelledError
.
Notes:
- if you write the logs from the scheduler and worker to a file, the error doesn’t seem to occur
- if you run this twice, the error won’t occur; it seems to only occur whenever the scheduler, worker and python process are all “fresh”
- this is on
distributed.__version__
= 1.26.1 - it appears to be threading related (running with 1 thread doesn’t reproduce)
It seems that removing the fire_and_forget
call inside mapper
fixes the situation.
cc: @jlowin
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:12 (12 by maintainers)
Top Results From Across the Web
Tornado unexpected exception in Future <Future cancelled ...
You are running sync function in async tornado. Try this one: from dask.distributed import Client from tornado.ioloop import IOLoop async ...
Read more >Tornado unexpected exception in Future <Future cancelled ...
I have set up a dask cluster. I can access a web dashboard, but when I'm trying to connect to the scheduler: from...
Read more >ThreadPoolExecutor in Python: The Complete Guide
Worker threads are designed to be re-used once the task is completed and provide protection against the unexpected failure of the task, ...
Read more >Introduction — Traits Futures 1.1 documentation
If the future has been cancelled, the next call to progress in the background task will raise a TaskCancelled exception. For example, your...
Read more >Progress versus utopia; or, Can We Imagine the Future?
It would in my opinion be a mistake to make the "apologia" for SF in terms of ... They train our organisms to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
They are all fairly similar in how they’re implemented. Queues are probably the cleanest. It would be straightforward, I think, to add other coordination primitives to Dask, should you all be interested. These could live within the main project, or probably outside as well if you wanted to manage and evolve them yourselves.
Fair; the only thing that still leaves me suspicious is the fact that this always seem to run fine on the second run, but not always on the first run. Not sure if there’s a bug in there, or if that’s just coincidence.
I think a full description of what client / future-sharing setup we’re trying to achieve is beyond the scope of this issue, but I’d be happy to share it if anyone is interested in digging into the Prefect code base 😃