Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

partition / window elements by unique keys?

See original GitHub issue

Hello! I’m currently using streamz to group messages streaming through a kafka topic by both number and time via the partition() and timed_window() methods, followed by bespoke deduplicating of grouped elements since many messages published to this topic are simply updates for records in a database. The issue I’ve run into is that grouping all those duplicate elements takes up a lot of RAM, and since I only want the first or last element corresponding to a given record, it’s also unnecessary. I cobbled together versions of these methods that only output unique values, and thought this use case might be applicable to more than just me. I’ve included the code below; if you’d like, I’m happy to submit this in a PR.

from typing import Any, Callable, Dict, Hashable, Union

import streamz
from streamz.core import Stream, identity, convert_interval
from tornado import gen


@Stream.register_api()
class partition_unique(Stream):
    """
    Partition stream elements into groups of equal size with unique keys only.
    
    Args:
        n: Number of (unique) elements to pass through as a group.
        key: Callable that accepts a stream element and returns
            a unique, hashable representation of the incoming data.
            For example, ``key=lambda x: x["a"]`` could be used to allow
            only elements with unique ``"a"`` values to pass through.
        keep: Which element to keep in the case that a unique key is already
            found in the group. If "first", keep element from the first occurrence
            of a given key; if "last", keep element from the most recent occurrence.
            Note that relative ordering of *elements* is preserved in the data
            passed through, and not ordering of *keys*.
        **kwargs
    
    Examples:
    
    .. code-block:: pycon
    
        >>> source = Stream()
        >>> stream = source.partition_unique(n=3, keep="first").sink(print)
        >>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
        >>> for ele in eles:
        ...     source.emit(ele)
        (1, 2, 3)
        (1, 3, 2)
        
        >>> source = Stream()
        >>> stream = source.partition_unique(n=3, keep="last").sink(print)
        >>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
        >>> for ele in eles:
        ...     source.emit(ele)
        (2, 1, 3)
        (1, 3, 2)
        
        >>> source = Stream()
        >>> stream = source.partition_unique(n=3, keep="last").sink(print)
        >>> eles = ["f", "fo", "f", "foo", "f", "foo", "foo", "fo"]
        >>> for ele in eles:
        ...     source.emit(ele)
        ('fo', 'f', 'foo')
        ('f', 'foo', 'fo')
    """
    _graphviz_shape = "diamond"

    def __init__(
        self,
        upstream,
        n: int,
        key: Callable[[Any], Hashable] = identity,
        keep: str = "first",  # Literal["first", "last"]
        **kwargs
    ):
        self.n = n
        self.key = key
        self.keep = keep
        self._buffer = {}
        self._metadata_buffer = {}
        Stream.__init__(self, upstream, **kwargs)
    
    def update(self, x, who=None, metadata=None):
        self._retain_refs(metadata)
        y = self.key(x)
        if self.keep == "last":
            # remove key if already present so that emitted value
            # will reflect elements' actual relative ordering
            try:
                self._buffer.pop(y)
                self._metadata_buffer.pop(y)
            except KeyError:
                pass
            self._buffer[y] = x
            self._metadata_buffer[y] = metadata
        else:  # self.keep == "first"
            if y not in self._buffer:
                self._buffer[y] = x
                self._metadata_buffer[y] = metadata
        if len(self._buffer) == self.n:
            result, self._buffer = tuple(self._buffer.values()), {}
            metadata_result, self._metadata_buffer = list(self._metadata_buffer.values()), {}
            ret = self._emit(result, metadata_result)
            self._release_refs(metadata_result)
            return ret
        else:
            return []


@Stream.register_api()
class timed_window_unique(Stream):
    """
    Emit a group of elements with unique keys every interval.
    
    Args:
        interval: Number of seconds over which to group elements,
            or a ``pandas``-style duration string that can be converted
            into seconds.
        key: Callable that accepts a stream element and returns
            a unique, hashable representation of the incoming data.
            For example, ``key=lambda x: x["a"]`` could be used to allow
            only elements with unique ``"a"`` values to pass through.
        keep: Which element to keep in the case that a unique key is already
            found in the group. If "first", keep element from the first occurrence
            of a given key; if "last", keep element from the most recent occurrence.
            Note that relative ordering of *elements* is preserved in the data
            passed through, and not ordering of *keys*.
        **kwargs
    
    Examples:
    
    .. code-block:: pycon
    
        >>> source = Stream()
        >>> stream = source.timed_window_unique(interval=2, keep="first").sink(print)
        >>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
        >>> for ele in eles:
        ...     source.emit(ele)
        ...     time.sleep(0.6)
        ()
        (1, 2, 3)
        (1, 3)
        (2,)
        ()
        
        >>> source = Stream()
        >>> stream = source.timed_window_unique(interval=2, keep="last").sink(print)
        >>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
        >>> for ele in eles:
        ...     source.emit(ele)
        ...     time.sleep(0.6)
        ()
        (2, 1, 3)
        (1, 3)
        (2,)
        ()
        
        >>> source = Stream()
        >>> stream = source.timed_window_unique(interval=2, key=lambda x: len(x), keep="last").sink(print)
        >>> eles = ["f", "fo", "f", "foo", "f", "foo", "foo", "fo"]
        >>> for ele in eles:
        ...     source.emit(ele)
        ...     time.sleep(0.6)
        ()
        ('fo', 'f', 'foo')
        ('f', 'foo')
        ('fo',)
        ()
    """
    _graphviz_shape = "octagon"

    def __init__(
        self,
        upstream,
        interval: Union[int, str],
        key: Callable[[Any], Hashable] = identity,
        keep: str = "first",  # Literal["first", "last"]
        **kwargs
    ):
        self.interval = convert_interval(interval)
        self.key = key
        self.keep = keep
        self._buffer = {}
        self._metadata_buffer = {}
        self.last = gen.moment
        Stream.__init__(self, upstream, ensure_io_loop=True, **kwargs)
        self.loop.add_callback(self.cb)

    def update(self, x, who=None, metadata=None):
        self._retain_refs(metadata)
        y = self.key(x)
        if self.keep == "last":
            # remove key if already present so that emitted value
            # will reflect elements' actual relative ordering
            try:
                self._buffer.pop(y)
                self._metadata_buffer.pop(y)
            except KeyError:
                pass
            self._buffer[y] = x
            self._metadata_buffer[y] = metadata
        else:  # self.keep == "first"
            if y not in self._buffer:
                self._buffer[y] = x
                self._metadata_buffer[y] = metadata
        return self.last
    
    @gen.coroutine
    def cb(self):
        while True:
            result, self._buffer = tuple(self._buffer.values()), {}
            metadata_result, self._metadata_buffer = list(self._metadata_buffer.values()), {}
            # TODO: figure out why metadata_result is handled differently here...
            m = [m for ml in metadata_result for m in ml]
            self.last = self._emit(result, m)
            self._release_refs(m)
            yield self.last
            yield gen.sleep(self.interval)

Issue Analytics

State:
Created 3 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

bdewildecommented, Nov 11, 2020

No worries, the terminology has also got me twisted up! 🙂 I think it should be easy to generalize key, but I’m less sure about adding in a timeout. I’ll see about submitting a pull request sometime soon — I’m busy with PyData Global for the next few days, but after that!

0reactions

bdewildecommented, Dec 8, 2020

Closing this out, since the PR addressing it was merged.

Top Results From Across the Web

22.6.1 Partitioning Keys, Primary Keys, and Unique Keys

This section discusses the relationship of partitioning keys with primary keys and unique keys. The rule governing this relationship can be expressed as ......

partitioned table and unique keys - Oracle Communities

Best Answer · 1) Index 4 and 5 can't be used to police the unique constraint as a unique index must have the...

Unique key constraints in Azure Cosmos DB - Microsoft Learn

The partition key combined with the unique key guarantees the uniqueness of an item within the scope of the container. For example, consider...

How to select distinct rows from a Spark Window partition

An answer to your question that scales up well with big data : df.dropDuplicates(include your key ...

How to use Window functions in SQL Server - SQLShack

PARTITION BY partition_list Defines the window (set of rows on which window function operates) for window functions. We need to provide a field ......