Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers

See original GitHub issue

Research

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/a/71803558/277716

Question about pandas

I linked a specific complete answer at stackoverflow which tackles the problem of deriving the equivalent of pandas.core.window.rolling.Rolling.max but the window is an arbitrary column of integers in the same dataframe; however: even if that solution strives to be vectorized: it’s extremely slow to the point of becoming unusable for large dataframes compared to the basic case of a constant window size; I suspect it may be impossible to be fast because SIMD hardware may prefer a constant nature of window size.

However: I wonder if the devs of the pandas software itself may have ideas of how to do that since they are the ones who have coded the extremely fast (vectorized) pandas.core.window.rolling.Rolling.max.

It would normally be a feature request for pandas.DataFrame.rolling to accept arbitrary integers from a column in the dataframe as a window but I don’t know if it’s even performant to do that.

Bug related to later comments below

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64') + 1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')

print(df)

Expected outcome: index 18 max1 should be -1.487828 instead of -1.932612

source of code and further discussion on the bug at stackoverflow

Issue Analytics

State:
Created a year ago
Comments:10 (4 by maintainers)

Top GitHub Comments

1reaction

mroeschkecommented, Jul 6, 2022

Seems like the core issue is being discussed in https://github.com/pandas-dev/pandas/issues/46726 so closing in favor of further discussion there

0reactions

epigramxcommented, Apr 10, 2022

@rhshadrach performance appears to not be an issue with the method posted at the bug report (assuming the bug is fixed). It appears comparable in performance to a regular rolling operation while a slow apply() operation was practically unusable for large dataframes.

In fact it appears so snappy that I wouldn’t be surprised if it was a default pandas.DataFrame.rolling feature; but I guess it’s not necessary if a custom indexer can do the same; by the way: an only ~40% slower method without cython was posted here.

Top Results From Across the Web

Is there a very fast (vectorized) way to calculate the equivalent ...

I wanted to practically calculate a basic dataframe.column.rolling(window).max() but the window is another column of arbitrary integers ...

pandas.DataFrame.rolling — pandas 1.5.2 documentation

If an integer, the fixed number of observations used for each window. ... Execute the rolling operation per single column or row (...

ENH: allow rolling with non-numerical (eg string) data · Issue ...

Need to find out fist/last on a string column in a rolling window. ... I factorized by creating an index with arbitrary integer...

Python | Pandas dataframe.rolling() - GeeksforGeeks

Pandas dataframe.rolling() function provides the feature of rolling window calculations. The concept of rolling window calculation is most ...

Fast and Robust Sliding Window Vectorization with NumPy

Sliding windows and time series go hand-in-hand but Python's ... any row of a 2D matrix arbitrarily using a 1D matrix of integer...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers

Research

Link to question on StackOverflow

Question about pandas

Bug related to later comments below

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min

BUG: hash_pandas_object ignores column name values