question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64') + 1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')

print(df)

Issue Description

This method basically tries to use a rolling operation where the window is an arbitrary series of integers instead of an integer or an offset. It is related to question/feature request #46716 and it was originally authored as an answer for a StackOverflow question here. There the author of the method notes on the bug: “The cython implementation seems to remember the largest starting index encountered so far and ‘clips’ smaller starting indices to the stored value. More technically correct: only stores the range of the largest start and largest end indices encountered so far in a queue, discarding smaller start indices and making them unavailable.

Expected Behavior

The result printed for index 18, should be -1.487828 instead of -1.932612, because at that point the window is 3 and it looks for the max between -1.932612 and -2.539703 and -1.487828,

Installed Versions

commit : 4bfe3d07b4858144c219b9346329027024102ab6 python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.10.102.1-microsoft-standard-WSL2 Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.2 numpy : 1.22.3 pytz : 2021.1 dateutil : 2.8.2 pip : 22.0.4 setuptools : 61.1.1 Cython : None pytest : 7.1.1 hypothesis : None sphinx : None blosc : 1.10.6 feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.1 IPython : None pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : 2.0.1 matplotlib : None numba : None numexpr : 2.7.3 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : 1.4.34 tables : 3.7.0 tabulate : 0.8.9 xarray : None xlrd : None xlwt : None zstandard : None

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mroeschkecommented, Apr 12, 2022

I am a little doubtful it can be sharable for other aggregations because IIUC the min/min window algorithm uses value comparisons since it’s just looking for min/max

does it make sense to have two different algorithms (fastpath/slowpath)

I suppose so, but not too thrilled about maintaining heuristics when to use fast vs slow in addition to maintaining both algorithms.

We’ve had precedent for collapsing two different algorithms before trading off performance for the sake of correctness & maintainability, so if going back to the more correct algorithm doesn’t incur that much of a performance hit I think that would be worthwhile

0reactions
rhshadrachcommented, Apr 11, 2022

@mroeschke - I haven’t taken a look if the used algorithm can be adapted for arbitrary windows; if not, does it make sense to have two different algorithms (fastpath/slowpath)?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error Message on indexer console - Splunk Community
message = Search results might be incomplete: the search process on peer %s ended prematurely. This can be caused by a variety of...
Read more >
Indexer troubleshooting guidance - Azure Cognitive Search
This article provides indexer problem and resolution guidance for cases when no error messages are returned from the service search.
Read more >
What's new in 1.4.0 (January 22, 2022) - Pandas
Already existing columns are skipped, i.e. the next available index is used for the target column name (GH14704). >>> In [24]: import io...
Read more >
INDEX MATCH in Google Sheets - Ablebits
Lookup the exact match using Google Sheets MATCH function. Google Sheets INDEX function. While MATCH shows where to look for your value (its ......
Read more >
InfluxDB error messages - InfluxData Documentation
The database name required error occurs when certain SHOW queries do not specify a database. Specify a database with an ON clause in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found