QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers
See original GitHub issueResearch
-
I have searched the [pandas] tag on StackOverflow for similar questions.
-
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/a/71803558/277716
Question about pandas
I linked a specific complete answer at stackoverflow which tackles the problem of deriving the equivalent of pandas.core.window.rolling.Rolling.max
but the window is an arbitrary column of integers in the same dataframe; however: even if that solution strives to be vectorized: it’s extremely slow to the point of becoming unusable for large dataframes compared to the basic case of a constant window size; I suspect it may be impossible to be fast because SIMD hardware may prefer a constant nature of window size.
However: I wonder if the devs of the pandas software itself may have ideas of how to do that since they are the ones who have coded the extremely fast (vectorized) pandas.core.window.rolling.Rolling.max
.
It would normally be a feature request for pandas.DataFrame.rolling
to accept arbitrary integers from a column in the dataframe as a window
but I don’t know if it’s even performant to do that.
Bug related to later comments below
import pandas as pd
from pandas import api
import numpy as np
class MultiWindowIndexer(api.indexers.BaseIndexer):
def __init__(self, window):
self.window = np.array(window)
super().__init__()
def get_window_bounds(self, num_values, min_periods, center, closed):
end = np.arange(num_values, dtype='int64') + 1
start = np.clip(end - self.window, 0, num_values)
return start, end
np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
np.random.randint(1, 4, size=a.shape),
np.arange(len(a))+1
)
df = pd.DataFrame({'Data': a, 'Window': w})
df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')
print(df)
Expected outcome: index 18 max1 should be -1.487828 instead of -1.932612
source of code and further discussion on the bug at stackoverflow
Issue Analytics
- State:
- Created a year ago
- Comments:10 (4 by maintainers)
Top GitHub Comments
Seems like the core issue is being discussed in https://github.com/pandas-dev/pandas/issues/46726 so closing in favor of further discussion there
@rhshadrach performance appears to not be an issue with the method posted at the bug report (assuming the bug is fixed). It appears comparable in performance to a regular rolling operation while a slow apply() operation was practically unusable for large dataframes.
In fact it appears so snappy that I wouldn’t be surprised if it was a default
pandas.DataFrame.rolling
feature; but I guess it’s not necessary if a custom indexer can do the same; by the way: an only ~40% slower method without cython was posted here.