Batch Processing : How to withhold output until input terminates
See original GitHub issueHello, thank you very much for designing and open sourcing this system. I’ve been reading y’alls papers on Trill, FASTER and Quill for years now.
I have a few questions about how to use it.
Trill for batch processing
I realize that Trill is primarily a streaming data engine. But I work with a lot of batched data also that i would like to perform queries against using Trill.
Maybe i don’t have the egress side of things setup correctly but my setup looks like this
public static async Task EventsPerHour(string rootFolder)
{
await
LogExtractor
.Create()
.ExtractSingleSiteDirectory(rootFolder)
.AsObservable()
.ToTemporalStreamable(
e => e.Entry.Timestamp.Ticks,
DisorderPolicy.Drop(TimeSpan.TicksPerHour),
FlushPolicy.None,
PeriodicPunctuationPolicy.None())
.GroupApply(
e => e.Entry.Timestamp.Hour,
g => g.Count(),
(g, v) => new { Hour = g.Key, Count = v })
.ToStreamEventObservable()
.ForEachAsync(i => {
Console.WriteLine(i.Payload);
});
}
So what i would like to get out of this somehow is a single list of (Hour, Count) pairs.
What i actually get is a lot of incremental updates as the data flows in. To compensate i made a handler method that tracks all the updates per group and only keeps the last one. It produces correct output, but it seems wasteful to have the engine continue to produce output that i’m discarding.
Can i tell Trill to withhold output until the input stream terminates? If so how?
Weakened Discoverability
Also, why are so many things marked [EditorBrowsable(EditorBrowsableState.Never)]
? For example, I see people using the 3 argument version of group apply in examples, but for whatever reason you have GroupSelectorInput<T>.Key
marked as never browseable, making the result selector function seem useless initially.
Is there a reason this property (and others like it) is hidden?
GroupApply vs Partition+Aggregate+SelectByKey
Are these two constructions equivalent? If so which should i prefer?
.GroupApply(
e => e.Entry.Timestamp.Hour,
g => g.Count(),
(g, v) => new { Hour = g.Key, Count = v })
//yield type `IStreamable<Empty,'a>
vs.
.Partition(e => e.Entry.Timestamp.Hour)
.Aggregate(g => g.Count())
.SelectByKey((time, key, count) => new { Hour = key, Count = count })
//yields type IStreamable<PartitionKey<int>, 'a>
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
Ah, the wonderment that is spam filters. I’m glad you’re having fun with Trill - it’s a blast to work on, too.
Re: Partition
The partition method does something kind of special and magical. I’ll try to explain as best I can.
There is a concept within Trill called “partitioned streams”. This feature is one way to get around the restriction within Trill that all data must be in order post-ingress. What it allows is for data to follow an independent timeline per partition. For instance, if you have data coming from a collection of sensors, and you want to do a query per sensor (normally done using GroupApply) but each sensor’s data may arrive at the processing node with different network lag, partitioned streams allows each sensor to have its data treated as its own timeline. Global disorder policies (e.g., Drop) are then applied on a per-sensor basis rather than globally.
The way that you “enable” this feature is by ingressing PartitionedStreamEvents instead of StreamEvents. Alternatively, one can enable this feature by using ToPartitionedStreamable instead of ToTemporalStreamable. In both of these cases, the result is that you end up with a stream of type IStreamable<PartitionKey<K>, P>. The marker “PartitionKey” in the key type of the streamable means you’re in partitioned world, the world’s strangest theme park.
Now, the method
Partition
allows the user to introduce partitions in the middle of a query rather than at ingress. This method allows the user to then do temporal operations on the data without worrying about keeping all data in order. For instance, a concrete feature request that we got was to be able to do different windowing on data based on a key. The Partition method allows the user to split the timeline, thus allowing each individual partition to be windowed independently without any fear of misordering. You could then have one partition do a tumbling window on an hour, another partition have a hopping window of 10 minutes with a hop of a minute, and so forth.A good example of the Partition method in action is the Rules Engine example in our samples repo.
For posterity,
If you set you periodic punctuations to the same interval as your window, Trill will emit punctuation event for the missing intervals.