question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers

See original GitHub issue

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/a/71803558/277716

Question about pandas

I linked a specific complete answer at stackoverflow which tackles the problem of deriving the equivalent of pandas.core.window.rolling.Rolling.max but the window is an arbitrary column of integers in the same dataframe; however: even if that solution strives to be vectorized: it’s extremely slow to the point of becoming unusable for large dataframes compared to the basic case of a constant window size; I suspect it may be impossible to be fast because SIMD hardware may prefer a constant nature of window size.

However: I wonder if the devs of the pandas software itself may have ideas of how to do that since they are the ones who have coded the extremely fast (vectorized) pandas.core.window.rolling.Rolling.max.

It would normally be a feature request for pandas.DataFrame.rolling to accept arbitrary integers from a column in the dataframe as a window but I don’t know if it’s even performant to do that.

Bug related to later comments below

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64') + 1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')

print(df)

Expected outcome: index 18 max1 should be -1.487828 instead of -1.932612

source of code and further discussion on the bug at stackoverflow

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mroeschkecommented, Jul 6, 2022

Seems like the core issue is being discussed in https://github.com/pandas-dev/pandas/issues/46726 so closing in favor of further discussion there

0reactions
epigramxcommented, Apr 10, 2022

@rhshadrach performance appears to not be an issue with the method posted at the bug report (assuming the bug is fixed). It appears comparable in performance to a regular rolling operation while a slow apply() operation was practically unusable for large dataframes.

In fact it appears so snappy that I wouldn’t be surprised if it was a default pandas.DataFrame.rolling feature; but I guess it’s not necessary if a custom indexer can do the same; by the way: an only ~40% slower method without cython was posted here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is there a very fast (vectorized) way to calculate the equivalent ...
I wanted to practically calculate a basic dataframe.column.rolling(window).max() but the window is another column of arbitrary integers ...
Read more >
pandas.DataFrame.rolling — pandas 1.5.2 documentation
If an integer, the fixed number of observations used for each window. ... Execute the rolling operation per single column or row (...
Read more >
ENH: allow rolling with non-numerical (eg string) data · Issue ...
Need to find out fist/last on a string column in a rolling window. ... I factorized by creating an index with arbitrary integer...
Read more >
Python | Pandas dataframe.rolling() - GeeksforGeeks
Pandas dataframe.rolling() function provides the feature of rolling window calculations. The concept of rolling window calculation is most ...
Read more >
Fast and Robust Sliding Window Vectorization with NumPy
Sliding windows and time series go hand-in-hand but Python's ... any row of a 2D matrix arbitrarily using a 1D matrix of integer...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found