A potential approach for checking if a datum is done
See original GitHub issueOne way to think about check pointing is knowing if a piece of data has completely exited the pipeline, it is no longer being processed or being held in a cache.
Backpressure can help out here, since we can know if the pipeline has returned control up the call stack.
I think there are three ways that control is returned:
- All computations have finished, the data has left through a sink
- Some computations have finished, the data was dropped by a filter
- Some computations have finished, the data was stored in a cache waiting for something
1 and 2 are handled by backpressure nicely, since if all the computations have finished then the data is done.
3 is a bit more tricky since we need to track the cached values.
Potential implementation
def _emit(..., metadata={'reference': RefCounter}):
RefCounter += len(self.downstreams)
for downstream in downstream:
downstream.update(x)
RefCounter -= 1
Caching nodes
def update(...):
cache.append(x)
RefCounter += 1
When the RefCounter hits zero the data is done.
Note that this would need to merge the refcounting metadata when joining nodes. For instance combine_latest
would need to propagate the ref counters for all incoming data sources, and all downstream data consumers would need to increment/decrement for all the ref counters.
We’d also need to make the refcounters at data ingestion time.
Issue Analytics
- State:
- Created 4 years ago
- Comments:35 (32 by maintainers)
Top GitHub Comments
I would like to implement this reference counter idea. As deadlines are looming on our side, I would like to make everything clear and get everyone on-board with how things will work before we start writing code.
Firstly, after thinking about this, it seems the reference counters idea is much simpler to implement will handle the multiple down-streams configuration much more elegantly than the previous PR where we were using futures to know when data is done. Thanks for proposing this.
Additionally, passing metadata along with the data seems like a good idea for features in the future which are not specified at this time.
If a pipeline in Streamz were nothing but calls to
map()
, then we would not need to implement anything to know when data is done. We would set up the pipeline as such.Each emit call would block until the data has exited the
sink_to_list()
call.However, there are nodes that cache the data such as
buffer()
andcombine_latest()
. These nodes will return without passing the data downstream. This will unravel the call stack, and the call toemit()
will return before the data exits the stream. Some of these nodes will emit the data downstream at some unspecified time in the future. This makes it hard to know when data is done being processed. To illustrate, the example above would interact with the call stack in the following way where the stack would grow at each time interval:After
t8
, the stack would unravel andemit()
would return in the user’s application. However, with the use ofbuffer()
, the call stack would not continue to grow until the end of the stream. It would unravel early. And causeemit()
to return before data exits the stream.In order to reconcile this issue, the proposed solution is to have a reference counter which would track how many nodes are currently holding references to a given datum. The processing on a given datum is considered to be done once the reference count reaches zero. The mapping of datum to metadata is 1:1. So, in the previous example, a reference counter could be passed to the
buffer()
node and thebuffer()
node could increment the reference count by 1 indicating that the datum is still being processed. Whenbuffer()
is ready to continue to emit the data downstream, it will decrement the counter by 1 after calling_emit()
to ensure that there is no time when the counter is zero before the data is done. This last detail is important. We do not want the reference counter to report zero references when the data is still in the pipeline.So, the reference counter would alter the previous example like so:
In another example,
combine_latest()
takes two streams as inputs. We can have two streams:a
andb
. If data is received ina
, a tuple will be emitted with the input froma
and the last received input fromb
. This means that the node will hold on to any incoming data elements, so a reference will need to be incremented for each. When the data changes, the reference counter can be decremented.However, in this latest example, we can see that that with two streams being combined into a single emitted tuple, there will need to be some way to combine metadata from all incoming streams to accompany the single emitted element. The proposed solution is to use a container for the reference counters rather than the bare reference itself. Namely, this container will be a list type. The incoming reference counters will all be emitted downstream, and the
Stream
class will need to have functionality to handle the both the case of a bare reference counter or a container holding a collection of reference counters.I would like to post a progress update and a bit of breakdown on the problem.
Previously, and for the purposes here, we have categorized the two types of functions in Streamz as cache nodes and non-cache nodes. I think it would make more sense to categorize them as nodes that call
_emit
and do not return which unwinds the call stack (synchronous nodes) and nodes that either do not call_emit
or return before calling_emit
(asynchronous nodes). The later causes the pipeline to become asynchronous and gives us a reason to track reference counts. All of the async’ nodes will handle the reference counters the same way. If data is cached, the counter will be incremented when the data comes in the node and after the data leaves the node, it will be decremented. Sync’ nodes will only pass the metadata downstream. This leaves last problem which is how will the async nodes handle the metadata. This is detailed below.Synchronous nodes
Core
Dask
Async’ nodes (grouped by strategy)
Core
Input to output is 1:1
Couple the metadata with the data on the cache and pull the metadata off the cache with the data and emit them together.
Input single values and output collections
To increase efficiency, the metadata for the last element can be emitted while the metadata for the other elements can be not emitted. However, this could lead to problems where the metadata for every element is expected downstream. The upstream or unit interested in when the data completes will see the elements that are not at the end of the collection as completed. The less efficient alternative is to cache the metadata with the data and emit the metadata as a collection.
Multiple stream inputs combined into a single output
All datum that is emitted downstream will have it’s accompanying metadata emitted with it as a collection.
to_kafka
I’m not yet sure what to do about this node. The
_emit
function is called, but the use of the Confluent library for the callback makes it difficult to match the metadata to the datum. There is also the question as to if we should make any changes here asto_kafka
should be the end of the pipeline.Dask
A yield is performed before _emit
The metadata is a simple pass-through just as in sync nodes, but the reference counter will need to incremented here as there is a yield before the
_emit
call