Trill multicast vs publish
See original GitHub issueHello,
I’d like to experiment with a Trill
lib a little bit and was wondering what is the best approach for use case where there is a single ‘source stream’ for example trading data, like individual trades and many queries for such source stream, where first query would be Where
that would narrow down trades to individual instruments. In a writing queries guide I’ve read about multicast that is necessary for such use case, but when looking at the source code I’ve also found about Publish
.
I’m not really familiar with rx
so that is a little bit confusing for me when to use Multicast
vs Publish
, what would you suggest? Or perhaps it’s better to create separate streams instead of single one for each Where
so then multicast is not necessary? Do you know any guidelines, lessons learned about that? I’d love to read more, but couldn’t find anything in docs.
I’ve also seen mentions about partitioning in the source code, is this only related to group
operator or something that could also be useful for my use case?
I’ll be setting low batch size (< 5, maybe less), are there any settings that I could tweak for very near real-time queries to get best perf, sacrificing throughput, but getting lowest latency possible ?
Thanks a lot!
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (7 by maintainers)
Top GitHub Comments
The “partitioned” versions of the operators that you see in the code has to do with a feature called Partitioned Streams. If you ingress data as PartitionedStreamEvents or specify a partition lambda at ingress, you essentially turn this feature on.
What the feature does is allow Trill to handle multiple timelines - one per key - instead of a single one. Handling disorder, for instance, becomes a per-partition concern. Ordinarily, time in Trill is considered a global construct that is uniform across all data that is seen. With partitions, each individual partition is allowed to progress time individually. The downside is that you cannot then query across partitions; whatever query you specify is applied per-partition.
For example, consider a scenario where you have 10k sensors measuring temperature, and you want to find the maximum temperature per sensor per day. Without partitions, the time that each sensor’s data is measured against a single advancing timeline. The disorder policies are applied globally. That means that if 100 of those 10k sensors are lagging well behind, then you will either not see results until they have caught up or that lagging data will be either dropped or have their time adjusted.
However, given that the query is returning answers per-sensor, there is really no reason for one sensor’s data being behind or ahead to impact any other sensor’s data. That’s what partitions allow - each sensor will have its own timeline that is not impacted by any other.
Multicast would be the best fit when you have a source that you would like to use to feed to a fixed number (known a priori) of receiver sub-queries. The source is Subscribed to exactly once, Trill inress, batching, and/or columnarization occur exactly once, and the same data is fed to all the multicast subscribers. The Subscribe to source happens as soon as the required number of Subscribe operations are performed on the Multicast endpoint.
Publish is the dynamic version – you create a Publish endpoint that anyone can dynamically Subscribe to even runtime. The (single) Subscribe to upstream occurs when you call Connect on the endpoint. Any new subscribers after a connect simply receive the stream starting from that point forward. Note that because such a subscribe latches on to the stream mid-stream, the user needs to be careful not to use end edges, because then you could have an end edge without a corresponding start edge, which would be a malformed stream. For this reason, I would avoid Publish unless you know what you are doing.
A third option is to use neither, just call Subscribe on the source separately for each query. This results in multiple Subscribe calls being made to the source, which may be more expensive (as each Subscribe will have its own Trill ingress, batching, etc.), but in this case, the source becomes responsible for generating a correct stream to each subscriber independently.