Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min

See original GitHub issue

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64') + 1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')

print(df)

Issue Description

This method basically tries to use a rolling operation where the window is an arbitrary series of integers instead of an integer or an offset. It is related to question/feature request #46716 and it was originally authored as an answer for a StackOverflow question here. There the author of the method notes on the bug: “The cython implementation seems to remember the largest starting index encountered so far and ‘clips’ smaller starting indices to the stored value. More technically correct: only stores the range of the largest start and largest end indices encountered so far in a queue, discarding smaller start indices and making them unavailable.”

Expected Behavior

The result printed for index 18, should be -1.487828 instead of -1.932612, because at that point the window is 3 and it looks for the max between -1.932612 and -2.539703 and -1.487828,

Installed Versions

commit : 4bfe3d07b4858144c219b9346329027024102ab6 python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.10.102.1-microsoft-standard-WSL2 Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.2 numpy : 1.22.3 pytz : 2021.1 dateutil : 2.8.2 pip : 22.0.4 setuptools : 61.1.1 Cython : None pytest : 7.1.1 hypothesis : None sphinx : None blosc : 1.10.6 feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.1 IPython : None pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : 2.0.1 matplotlib : None numba : None numexpr : 2.7.3 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : 1.4.34 tables : 3.7.0 tabulate : 0.8.9 xarray : None xlrd : None xlwt : None zstandard : None

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mroeschkecommented, Apr 12, 2022

I am a little doubtful it can be sharable for other aggregations because IIUC the min/min window algorithm uses value comparisons since it’s just looking for min/max

does it make sense to have two different algorithms (fastpath/slowpath)

I suppose so, but not too thrilled about maintaining heuristics when to use fast vs slow in addition to maintaining both algorithms.

We’ve had precedent for collapsing two different algorithms before trading off performance for the sake of correctness & maintainability, so if going back to the more correct algorithm doesn’t incur that much of a performance hit I think that would be worthwhile

0reactions

rhshadrachcommented, Apr 11, 2022

@mroeschke - I haven’t taken a look if the used algorithm can be adapted for arbitrary windows; if not, does it make sense to have two different algorithms (fastpath/slowpath)?

Top Results From Across the Web

Error Message on indexer console - Splunk Community

message = Search results might be incomplete: the search process on peer %s ended prematurely. This can be caused by a variety of...

Indexer troubleshooting guidance - Azure Cognitive Search

This article provides indexer problem and resolution guidance for cases when no error messages are returned from the service search.

What's new in 1.4.0 (January 22, 2022) - Pandas

Already existing columns are skipped, i.e. the next available index is used for the target column name (GH14704). >>> In [24]: import io...

INDEX MATCH in Google Sheets - Ablebits

Lookup the exact match using Google Sheets MATCH function. Google Sheets INDEX function. While MATCH shows where to look for your value (its ......

InfluxDB error messages - InfluxData Documentation

The database name required error occurs when certain SHOW queries do not specify a database. Specify a database with an ON clause in...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

BUG: Series rolling .var with window>14, center=True, and any win_type crashes python with no error message

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers