partition / window elements by unique keys?
See original GitHub issueHello! I’m currently using streamz
to group messages streaming through a kafka topic by both number and time via the partition()
and timed_window()
methods, followed by bespoke deduplicating of grouped elements since many messages published to this topic are simply updates for records in a database. The issue I’ve run into is that grouping all those duplicate elements takes up a lot of RAM, and since I only want the first or last element corresponding to a given record, it’s also unnecessary. I cobbled together versions of these methods that only output unique values, and thought this use case might be applicable to more than just me. I’ve included the code below; if you’d like, I’m happy to submit this in a PR.
from typing import Any, Callable, Dict, Hashable, Union
import streamz
from streamz.core import Stream, identity, convert_interval
from tornado import gen
@Stream.register_api()
class partition_unique(Stream):
"""
Partition stream elements into groups of equal size with unique keys only.
Args:
n: Number of (unique) elements to pass through as a group.
key: Callable that accepts a stream element and returns
a unique, hashable representation of the incoming data.
For example, ``key=lambda x: x["a"]`` could be used to allow
only elements with unique ``"a"`` values to pass through.
keep: Which element to keep in the case that a unique key is already
found in the group. If "first", keep element from the first occurrence
of a given key; if "last", keep element from the most recent occurrence.
Note that relative ordering of *elements* is preserved in the data
passed through, and not ordering of *keys*.
**kwargs
Examples:
.. code-block:: pycon
>>> source = Stream()
>>> stream = source.partition_unique(n=3, keep="first").sink(print)
>>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
>>> for ele in eles:
... source.emit(ele)
(1, 2, 3)
(1, 3, 2)
>>> source = Stream()
>>> stream = source.partition_unique(n=3, keep="last").sink(print)
>>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
>>> for ele in eles:
... source.emit(ele)
(2, 1, 3)
(1, 3, 2)
>>> source = Stream()
>>> stream = source.partition_unique(n=3, keep="last").sink(print)
>>> eles = ["f", "fo", "f", "foo", "f", "foo", "foo", "fo"]
>>> for ele in eles:
... source.emit(ele)
('fo', 'f', 'foo')
('f', 'foo', 'fo')
"""
_graphviz_shape = "diamond"
def __init__(
self,
upstream,
n: int,
key: Callable[[Any], Hashable] = identity,
keep: str = "first", # Literal["first", "last"]
**kwargs
):
self.n = n
self.key = key
self.keep = keep
self._buffer = {}
self._metadata_buffer = {}
Stream.__init__(self, upstream, **kwargs)
def update(self, x, who=None, metadata=None):
self._retain_refs(metadata)
y = self.key(x)
if self.keep == "last":
# remove key if already present so that emitted value
# will reflect elements' actual relative ordering
try:
self._buffer.pop(y)
self._metadata_buffer.pop(y)
except KeyError:
pass
self._buffer[y] = x
self._metadata_buffer[y] = metadata
else: # self.keep == "first"
if y not in self._buffer:
self._buffer[y] = x
self._metadata_buffer[y] = metadata
if len(self._buffer) == self.n:
result, self._buffer = tuple(self._buffer.values()), {}
metadata_result, self._metadata_buffer = list(self._metadata_buffer.values()), {}
ret = self._emit(result, metadata_result)
self._release_refs(metadata_result)
return ret
else:
return []
@Stream.register_api()
class timed_window_unique(Stream):
"""
Emit a group of elements with unique keys every interval.
Args:
interval: Number of seconds over which to group elements,
or a ``pandas``-style duration string that can be converted
into seconds.
key: Callable that accepts a stream element and returns
a unique, hashable representation of the incoming data.
For example, ``key=lambda x: x["a"]`` could be used to allow
only elements with unique ``"a"`` values to pass through.
keep: Which element to keep in the case that a unique key is already
found in the group. If "first", keep element from the first occurrence
of a given key; if "last", keep element from the most recent occurrence.
Note that relative ordering of *elements* is preserved in the data
passed through, and not ordering of *keys*.
**kwargs
Examples:
.. code-block:: pycon
>>> source = Stream()
>>> stream = source.timed_window_unique(interval=2, keep="first").sink(print)
>>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
>>> for ele in eles:
... source.emit(ele)
... time.sleep(0.6)
()
(1, 2, 3)
(1, 3)
(2,)
()
>>> source = Stream()
>>> stream = source.timed_window_unique(interval=2, keep="last").sink(print)
>>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
>>> for ele in eles:
... source.emit(ele)
... time.sleep(0.6)
()
(2, 1, 3)
(1, 3)
(2,)
()
>>> source = Stream()
>>> stream = source.timed_window_unique(interval=2, key=lambda x: len(x), keep="last").sink(print)
>>> eles = ["f", "fo", "f", "foo", "f", "foo", "foo", "fo"]
>>> for ele in eles:
... source.emit(ele)
... time.sleep(0.6)
()
('fo', 'f', 'foo')
('f', 'foo')
('fo',)
()
"""
_graphviz_shape = "octagon"
def __init__(
self,
upstream,
interval: Union[int, str],
key: Callable[[Any], Hashable] = identity,
keep: str = "first", # Literal["first", "last"]
**kwargs
):
self.interval = convert_interval(interval)
self.key = key
self.keep = keep
self._buffer = {}
self._metadata_buffer = {}
self.last = gen.moment
Stream.__init__(self, upstream, ensure_io_loop=True, **kwargs)
self.loop.add_callback(self.cb)
def update(self, x, who=None, metadata=None):
self._retain_refs(metadata)
y = self.key(x)
if self.keep == "last":
# remove key if already present so that emitted value
# will reflect elements' actual relative ordering
try:
self._buffer.pop(y)
self._metadata_buffer.pop(y)
except KeyError:
pass
self._buffer[y] = x
self._metadata_buffer[y] = metadata
else: # self.keep == "first"
if y not in self._buffer:
self._buffer[y] = x
self._metadata_buffer[y] = metadata
return self.last
@gen.coroutine
def cb(self):
while True:
result, self._buffer = tuple(self._buffer.values()), {}
metadata_result, self._metadata_buffer = list(self._metadata_buffer.values()), {}
# TODO: figure out why metadata_result is handled differently here...
m = [m for ml in metadata_result for m in ml]
self.last = self._emit(result, m)
self._release_refs(m)
yield self.last
yield gen.sleep(self.interval)
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
22.6.1 Partitioning Keys, Primary Keys, and Unique Keys
This section discusses the relationship of partitioning keys with primary keys and unique keys. The rule governing this relationship can be expressed as ......
Read more >partitioned table and unique keys - Oracle Communities
Best Answer · 1) Index 4 and 5 can't be used to police the unique constraint as a unique index must have the...
Read more >Unique key constraints in Azure Cosmos DB - Microsoft Learn
The partition key combined with the unique key guarantees the uniqueness of an item within the scope of the container. For example, consider...
Read more >How to select distinct rows from a Spark Window partition
An answer to your question that scales up well with big data : df.dropDuplicates(include your key ...
Read more >How to use Window functions in SQL Server - SQLShack
PARTITION BY partition_list Defines the window (set of rows on which window function operates) for window functions. We need to provide a field ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
No worries, the terminology has also got me twisted up! 🙂 I think it should be easy to generalize
key
, but I’m less sure about adding in a timeout. I’ll see about submitting a pull request sometime soon — I’m busy with PyData Global for the next few days, but after that!Closing this out, since the PR addressing it was merged.