question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

partition / window elements by unique keys?

See original GitHub issue

Hello! I’m currently using streamz to group messages streaming through a kafka topic by both number and time via the partition() and timed_window() methods, followed by bespoke deduplicating of grouped elements since many messages published to this topic are simply updates for records in a database. The issue I’ve run into is that grouping all those duplicate elements takes up a lot of RAM, and since I only want the first or last element corresponding to a given record, it’s also unnecessary. I cobbled together versions of these methods that only output unique values, and thought this use case might be applicable to more than just me. I’ve included the code below; if you’d like, I’m happy to submit this in a PR.

from typing import Any, Callable, Dict, Hashable, Union

import streamz
from streamz.core import Stream, identity, convert_interval
from tornado import gen


@Stream.register_api()
class partition_unique(Stream):
    """
    Partition stream elements into groups of equal size with unique keys only.
    
    Args:
        n: Number of (unique) elements to pass through as a group.
        key: Callable that accepts a stream element and returns
            a unique, hashable representation of the incoming data.
            For example, ``key=lambda x: x["a"]`` could be used to allow
            only elements with unique ``"a"`` values to pass through.
        keep: Which element to keep in the case that a unique key is already
            found in the group. If "first", keep element from the first occurrence
            of a given key; if "last", keep element from the most recent occurrence.
            Note that relative ordering of *elements* is preserved in the data
            passed through, and not ordering of *keys*.
        **kwargs
    
    Examples:
    
    .. code-block:: pycon
    
        >>> source = Stream()
        >>> stream = source.partition_unique(n=3, keep="first").sink(print)
        >>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
        >>> for ele in eles:
        ...     source.emit(ele)
        (1, 2, 3)
        (1, 3, 2)
        
        >>> source = Stream()
        >>> stream = source.partition_unique(n=3, keep="last").sink(print)
        >>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
        >>> for ele in eles:
        ...     source.emit(ele)
        (2, 1, 3)
        (1, 3, 2)
        
        >>> source = Stream()
        >>> stream = source.partition_unique(n=3, keep="last").sink(print)
        >>> eles = ["f", "fo", "f", "foo", "f", "foo", "foo", "fo"]
        >>> for ele in eles:
        ...     source.emit(ele)
        ('fo', 'f', 'foo')
        ('f', 'foo', 'fo')
    """
    _graphviz_shape = "diamond"

    def __init__(
        self,
        upstream,
        n: int,
        key: Callable[[Any], Hashable] = identity,
        keep: str = "first",  # Literal["first", "last"]
        **kwargs
    ):
        self.n = n
        self.key = key
        self.keep = keep
        self._buffer = {}
        self._metadata_buffer = {}
        Stream.__init__(self, upstream, **kwargs)
    
    def update(self, x, who=None, metadata=None):
        self._retain_refs(metadata)
        y = self.key(x)
        if self.keep == "last":
            # remove key if already present so that emitted value
            # will reflect elements' actual relative ordering
            try:
                self._buffer.pop(y)
                self._metadata_buffer.pop(y)
            except KeyError:
                pass
            self._buffer[y] = x
            self._metadata_buffer[y] = metadata
        else:  # self.keep == "first"
            if y not in self._buffer:
                self._buffer[y] = x
                self._metadata_buffer[y] = metadata
        if len(self._buffer) == self.n:
            result, self._buffer = tuple(self._buffer.values()), {}
            metadata_result, self._metadata_buffer = list(self._metadata_buffer.values()), {}
            ret = self._emit(result, metadata_result)
            self._release_refs(metadata_result)
            return ret
        else:
            return []


@Stream.register_api()
class timed_window_unique(Stream):
    """
    Emit a group of elements with unique keys every interval.
    
    Args:
        interval: Number of seconds over which to group elements,
            or a ``pandas``-style duration string that can be converted
            into seconds.
        key: Callable that accepts a stream element and returns
            a unique, hashable representation of the incoming data.
            For example, ``key=lambda x: x["a"]`` could be used to allow
            only elements with unique ``"a"`` values to pass through.
        keep: Which element to keep in the case that a unique key is already
            found in the group. If "first", keep element from the first occurrence
            of a given key; if "last", keep element from the most recent occurrence.
            Note that relative ordering of *elements* is preserved in the data
            passed through, and not ordering of *keys*.
        **kwargs
    
    Examples:
    
    .. code-block:: pycon
    
        >>> source = Stream()
        >>> stream = source.timed_window_unique(interval=2, keep="first").sink(print)
        >>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
        >>> for ele in eles:
        ...     source.emit(ele)
        ...     time.sleep(0.6)
        ()
        (1, 2, 3)
        (1, 3)
        (2,)
        ()
        
        >>> source = Stream()
        >>> stream = source.timed_window_unique(interval=2, keep="last").sink(print)
        >>> eles = [1, 2, 1, 3, 1, 3, 3, 2]
        >>> for ele in eles:
        ...     source.emit(ele)
        ...     time.sleep(0.6)
        ()
        (2, 1, 3)
        (1, 3)
        (2,)
        ()
        
        >>> source = Stream()
        >>> stream = source.timed_window_unique(interval=2, key=lambda x: len(x), keep="last").sink(print)
        >>> eles = ["f", "fo", "f", "foo", "f", "foo", "foo", "fo"]
        >>> for ele in eles:
        ...     source.emit(ele)
        ...     time.sleep(0.6)
        ()
        ('fo', 'f', 'foo')
        ('f', 'foo')
        ('fo',)
        ()
    """
    _graphviz_shape = "octagon"

    def __init__(
        self,
        upstream,
        interval: Union[int, str],
        key: Callable[[Any], Hashable] = identity,
        keep: str = "first",  # Literal["first", "last"]
        **kwargs
    ):
        self.interval = convert_interval(interval)
        self.key = key
        self.keep = keep
        self._buffer = {}
        self._metadata_buffer = {}
        self.last = gen.moment
        Stream.__init__(self, upstream, ensure_io_loop=True, **kwargs)
        self.loop.add_callback(self.cb)

    def update(self, x, who=None, metadata=None):
        self._retain_refs(metadata)
        y = self.key(x)
        if self.keep == "last":
            # remove key if already present so that emitted value
            # will reflect elements' actual relative ordering
            try:
                self._buffer.pop(y)
                self._metadata_buffer.pop(y)
            except KeyError:
                pass
            self._buffer[y] = x
            self._metadata_buffer[y] = metadata
        else:  # self.keep == "first"
            if y not in self._buffer:
                self._buffer[y] = x
                self._metadata_buffer[y] = metadata
        return self.last
    
    @gen.coroutine
    def cb(self):
        while True:
            result, self._buffer = tuple(self._buffer.values()), {}
            metadata_result, self._metadata_buffer = list(self._metadata_buffer.values()), {}
            # TODO: figure out why metadata_result is handled differently here...
            m = [m for ml in metadata_result for m in ml]
            self.last = self._emit(result, m)
            self._release_refs(m)
            yield self.last
            yield gen.sleep(self.interval)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
bdewildecommented, Nov 11, 2020

No worries, the terminology has also got me twisted up! 🙂 I think it should be easy to generalize key, but I’m less sure about adding in a timeout. I’ll see about submitting a pull request sometime soon — I’m busy with PyData Global for the next few days, but after that!

0reactions
bdewildecommented, Dec 8, 2020

Closing this out, since the PR addressing it was merged.

Read more comments on GitHub >

github_iconTop Results From Across the Web

22.6.1 Partitioning Keys, Primary Keys, and Unique Keys
This section discusses the relationship of partitioning keys with primary keys and unique keys. The rule governing this relationship can be expressed as ......
Read more >
partitioned table and unique keys - Oracle Communities
Best Answer · 1) Index 4 and 5 can't be used to police the unique constraint as a unique index must have the...
Read more >
Unique key constraints in Azure Cosmos DB - Microsoft Learn
The partition key combined with the unique key guarantees the uniqueness of an item within the scope of the container. For example, consider...
Read more >
How to select distinct rows from a Spark Window partition
An answer to your question that scales up well with big data : df.dropDuplicates(include your key ...
Read more >
How to use Window functions in SQL Server - SQLShack
PARTITION BY partition_list​​ Defines the window (set of rows on which window function operates) for window functions. We need to provide a field ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found