Communicate Futures between Clients
See original GitHub issueHow do we communicate futures between clients?
The recent ability of tasks to spawn other tasks with local_client
opens the possibility for highly dynamic computations. For example we could have workers watch live data sources and create feeds of data to which other clients can respond. This is all possible today except that there is no mechanism for clients to share futures between each other. They all use the same computation memory resources but have no ability to make each other aware of what is on the cluster. Several users of local_client
have noted this limitation.
So what is the right abstraction to communicate futures between clients?
Option One: Kafka Topics
We could hold Future objects in a rolling buffer. This would be a data structure living on the centralized scheduler to which all clients would get a view. Appends would always happen to the end of this buffer while every client would also have a read-head that they could use for iteration. It’s worth noting that, unlike Kafka this wouldn’t hold actual data, just Futures
. The actual data would live on the workers.
Option Two: Custom
I’ve spoken to a few people that have custom needs here, so providing one abstraction like Kafka’s topic probably doesn’t suffice. Instead we would want to create an interface that others could implement and upload to the scheduler. A couple examples that have come up in the wild:
- A collection into which we would place futures sequentially but would only allow collecting that future fixed number of times
- A fixed length collection that we would mutate with new futures as new data came in. Clients watching this collection might re-run a computation on every change.
- Our current
get/publish_dataset
functionality might fall into this
API Play
So here is a possible API for critique:
Data Collection
with local_client() as c:
topic = c.topic('raw-data')
for batch in watch_data_feed(...):
future = c.scatter(batch)
topic.append(future)
Process Data
We pull the raw data futures into two different worker/clients, process with a couple of functions, and then push out to a feed of processed data.
with local_client() as c:
raw = c.topic('raw-data')
processed = c.topic('processed-data')
for future in raw:
future2 = c.submit(f, future)
processed.append(future2)
with local_client() as c:
raw = c.topic('raw-data')
processed = c.topic('processed-data')
for future in raw:
future2 = c.submit(g, future)
processed.append(future2)
Collect
We get futures from the processed data topic, gather the data to the local machine, and then do something with it here such as emit some plots or analysis. We pull the results of both processing clients without caring which future was submitted by which client.
with local_client() as c:
processed = c.topic('processed-data')
for future in processed:
data = c.gather(future)
...
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
I think that you could accomplish what you’re asking for above with the current implementation and a tiny bit of tornado.
This was implemented in #729