Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

A potential approach for checking if a datum is done

See original GitHub issue

One way to think about check pointing is knowing if a piece of data has completely exited the pipeline, it is no longer being processed or being held in a cache.

Backpressure can help out here, since we can know if the pipeline has returned control up the call stack.

I think there are three ways that control is returned:

All computations have finished, the data has left through a sink
Some computations have finished, the data was dropped by a filter
Some computations have finished, the data was stored in a cache waiting for something

1 and 2 are handled by backpressure nicely, since if all the computations have finished then the data is done.

3 is a bit more tricky since we need to track the cached values.

Potential implementation

def _emit(..., metadata={'reference': RefCounter}):
    RefCounter += len(self.downstreams)
    for downstream in downstream:
        downstream.update(x)
        RefCounter -= 1

Caching nodes

def update(...):
    cache.append(x)
    RefCounter += 1

When the RefCounter hits zero the data is done.

Note that this would need to merge the refcounting metadata when joining nodes. For instance combine_latest would need to propagate the ref counters for all incoming data sources, and all downstream data consumers would need to increment/decrement for all the ref counters.

We’d also need to make the refcounters at data ingestion time.

Issue Analytics

State:
Created 4 years ago
Comments:35 (32 by maintainers)

Top GitHub Comments

3reactions

jsmaupincommented, Jan 21, 2020

I would like to implement this reference counter idea. As deadlines are looming on our side, I would like to make everything clear and get everyone on-board with how things will work before we start writing code.

Firstly, after thinking about this, it seems the reference counters idea is much simpler to implement will handle the multiple down-streams configuration much more elegantly than the previous PR where we were using futures to know when data is done. Thanks for proposing this.

Additionally, passing metadata along with the data seems like a good idea for features in the future which are not specified at this time.

If a pipeline in Streamz were nothing but calls to map(), then we would not need to implement anything to know when data is done. We would set up the pipeline as such.

source = Stream()
L = source.map(foo1).map(foo2).map(foo3).sink_to_list()
source.emit(1)
source.emit(2)
source.emit(3)
source.emit(4)

Each emit call would block until the data has exited the sink_to_list() call.

However, there are nodes that cache the data such as buffer() and combine_latest(). These nodes will return without passing the data downstream. This will unravel the call stack, and the call to emit() will return before the data exits the stream. Some of these nodes will emit the data downstream at some unspecified time in the future. This makes it hard to know when data is done being processed. To illustrate, the example above would interact with the call stack in the following way where the stack would grow at each time interval:

t8: return
t7: sink.update()  call stack offset: 7
t6: stream._emit() call stack offset: 6
t5: map3.update()  call stack offset: 5
t4: stream._emit() call stack offset: 4
t3: map2.update()  call stack offset: 3
t2: stream._emit() call stack offset: 2
t1: map1.update()  call stack offset: 1
t0: source._emit() call stack offset: 0

After t8, the stack would unravel and emit() would return in the user’s application. However, with the use of buffer(), the call stack would not continue to grow until the end of the stream. It would unravel early. And cause emit() to return before data exits the stream.

source = Stream()
L = source.map(foo1).buffer(1000).sink_to_list()

t4: return
t3: buffer.update() call stack offset: 3
t2: stream._emit()  call stack offset: 2
t1: map.update()    call stack offset: 1
t0: source._emit()  call stack offset: 0

In order to reconcile this issue, the proposed solution is to have a reference counter which would track how many nodes are currently holding references to a given datum. The processing on a given datum is considered to be done once the reference count reaches zero. The mapping of datum to metadata is 1:1. So, in the previous example, a reference counter could be passed to the buffer() node and the buffer() node could increment the reference count by 1 indicating that the datum is still being processed. When buffer() is ready to continue to emit the data downstream, it will decrement the counter by 1 after calling _emit() to ensure that there is no time when the counter is zero before the data is done. This last detail is important. We do not want the reference counter to report zero references when the data is still in the pipeline.

So, the reference counter would alter the previous example like so:

tn+2: return                                               ref: 0
tn+1: sink.update()                 call stack offset: 1   ref: 2   
tn+0: stream._emit() # from buffer  call stack offset: 0   ref: 2                       
t4:   return                                               ref: 1
t3:   buffer.update()               call stack offset: 3   ref: 3 
t2:   stream._emit()                call stack offset: 2   ref: 2
t1:   map.update()                  call stack offset: 1   ref: 1
t0:   source._emit()                call stack offset: 0   ref: 1

In another example, combine_latest() takes two streams as inputs. We can have two streams: a and b. If data is received in a, a tuple will be emitted with the input from a and the last received input from b. This means that the node will hold on to any incoming data elements, so a reference will need to be incremented for each. When the data changes, the reference counter can be decremented.

However, in this latest example, we can see that that with two streams being combined into a single emitted tuple, there will need to be some way to combine metadata from all incoming streams to accompany the single emitted element. The proposed solution is to use a container for the reference counters rather than the bare reference itself. Namely, this container will be a list type. The incoming reference counters will all be emitted downstream, and the Stream class will need to have functionality to handle the both the case of a bare reference counter or a container holding a collection of reference counters.

2reactions

jsmaupincommented, Jan 29, 2020

I would like to post a progress update and a bit of breakdown on the problem.

Previously, and for the purposes here, we have categorized the two types of functions in Streamz as cache nodes and non-cache nodes. I think it would make more sense to categorize them as nodes that call _emit and do not return which unwinds the call stack (synchronous nodes) and nodes that either do not call _emit or return before calling _emit (asynchronous nodes). The later causes the pipeline to become asynchronous and gives us a reason to track reference counts. All of the async’ nodes will handle the reference counters the same way. If data is cached, the counter will be incremented when the data comes in the node and after the data leaves the node, it will be decremented. Sync’ nodes will only pass the metadata downstream. This leaves last problem which is how will the async nodes handle the metadata. This is detailed below.

Synchronous nodes

Core

accumulate [Done]
filter [Done]
map [Done]
pluck [Done]
sink [Done]
rate_limit [Done]
slice [Done]
starmap [Done]
union [Done]
unique [Done]
flatten - The only special case for sync’ nodes. The reference to the reference counters can not be duplicated for each messages or their counts will be decremented below zero. Either new reference counters will need to be created for each new data element, or the metadata will need to be emitted only once. If the latter implementation is used, the metadata will be emitted with the last data element as it would make the most sense for check-pointing.

Dask

accumulate [Done]
map [Done]
starmap [Done]

Async’ nodes (grouped by strategy)

Core

Input to output is 1:1

Couple the metadata with the data on the cache and pull the metadata off the cache with the data and emit them together.

buffer [Done]
delay [Done]
latest [Done]

Input single values and output collections

To increase efficiency, the metadata for the last element can be emitted while the metadata for the other elements can be not emitted. However, this could lead to problems where the metadata for every element is expected downstream. The upstream or unit interested in when the data completes will see the elements that are not at the end of the collection as completed. The less efficient alternative is to cache the metadata with the data and emit the metadata as a collection.

collect
partition
sliding_window
timed_window

Multiple stream inputs combined into a single output

All datum that is emitted downstream will have it’s accompanying metadata emitted with it as a collection.

combine_latest
zip
zip_latest

to_kafka

I’m not yet sure what to do about this node. The _emit function is called, but the use of the Confluent library for the callback makes it difficult to match the metadata to the datum. There is also the question as to if we should make any changes here as to_kafka should be the end of the pipeline.

Dask

A yield is performed before _emit

The metadata is a simple pass-through just as in sync nodes, but the reference counter will need to incremented here as there is a yield before the _emit call

gather [Done]
scatter [Done]

Top Results From Across the Web

Datum Feature | GD&T Basics - GDandTBasics.com

Datum Features are real, tangible features on a part where the measurement equipment would physically touch or measure. They are usually important functional ......

Datum Target - an overview | ScienceDirect Topics

The 3-2-1 method. The primary datum has three datum targets, the secondary two and the tertiary one. See Fig. 3.97. The part does...

Datum references, tolerance zone and material condition in ...

Datum reference frame is the coordinate reference to verify the GD&T tolerances on a part for related tolerances. In real situation, a datum ......

Relating the Secondary Datum Feature to Primary Datums

Qualification of your datum features is important to ensure that parts are inspected properly. Check out our latest video an example of how ......

GD&T in automotive assembly: using datum targets to locate ...

The aim of ideal location is to create an accurate and repeatable datum reference frame for the part while avoiding overconstraint. The 3-2-1...