BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
from pandas import api
import numpy as np
class MultiWindowIndexer(api.indexers.BaseIndexer):
def __init__(self, window):
self.window = np.array(window)
super().__init__()
def get_window_bounds(self, num_values, min_periods, center, closed):
end = np.arange(num_values, dtype='int64') + 1
start = np.clip(end - self.window, 0, num_values)
return start, end
np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
np.random.randint(1, 4, size=a.shape),
np.arange(len(a))+1
)
df = pd.DataFrame({'Data': a, 'Window': w})
df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')
print(df)
Issue Description
This method basically tries to use a rolling
operation where the window
is an arbitrary series of integers instead of an integer or an offset. It is related to question/feature request #46716 and it was originally authored as an answer for a StackOverflow question here. There the author of the method notes on the bug: “The cython implementation seems to remember the largest starting index encountered so far and ‘clips’ smaller starting indices to the stored value. More technically correct: only stores the range of the largest start and largest end indices encountered so far in a queue, discarding smaller start indices and making them unavailable.”
Expected Behavior
The result printed for index 18, should be -1.487828 instead of -1.932612, because at that point the window is 3 and it looks for the max between -1.932612 and -2.539703 and -1.487828,
Installed Versions
commit : 4bfe3d07b4858144c219b9346329027024102ab6 python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.10.102.1-microsoft-standard-WSL2 Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.4.2 numpy : 1.22.3 pytz : 2021.1 dateutil : 2.8.2 pip : 22.0.4 setuptools : 61.1.1 Cython : None pytest : 7.1.1 hypothesis : None sphinx : None blosc : 1.10.6 feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.1 IPython : None pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : 2.0.1 matplotlib : None numba : None numexpr : 2.7.3 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : 1.4.34 tables : 3.7.0 tabulate : 0.8.9 xarray : None xlrd : None xlwt : None zstandard : None
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
I am a little doubtful it can be sharable for other aggregations because IIUC the min/min window algorithm uses value comparisons since it’s just looking for min/max
I suppose so, but not too thrilled about maintaining heuristics when to use fast vs slow in addition to maintaining both algorithms.
We’ve had precedent for collapsing two different algorithms before trading off performance for the sake of correctness & maintainability, so if going back to the more correct algorithm doesn’t incur that much of a performance hit I think that would be worthwhile
@mroeschke - I haven’t taken a look if the used algorithm can be adapted for arbitrary windows; if not, does it make sense to have two different algorithms (fastpath/slowpath)?