question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: rolling window functions don't support custom indexers

See original GitHub issue

Code Sample, a copy-pastable example if possible

class ForwardIndexer(BaseIndexer):
    
    def get_window_bounds(self, num_values, min_periods, center, closed):
        start = np.empty(num_values, dtype=np.int64)
        end = np.empty(num_values, dtype=np.int64)
        for i in range(num_values):
            if i + min_periods <= num_values:
                start[i] = i
                end[i] = min(i + self.window_size, num_values)
            else:
                start[i] = i
                end[i] = i + 1
        return start, end

x = pd.DataFrame({"a": [1,2,3,4,5,6,7,8,9]})

rolling = x["a"].rolling(ForwardIndexer(window_size=3), min_periods=2)

result = rolling.min()
result

OUT:
0    0.0
1    0.0
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    7.0
8    NaN
Name: a, dtype: float64

IN:
expected = rolling.apply(lambda x: min(x))
expected

OUT:
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    NaN
Name: a, dtype: float64

Problem description

We state here that we support supplying a custom Indexer when building a pandas.DataFrame.rolling object. While the object does get built, and it returns the correct windows, it doesn’t support many rolling window functions. The problem is that our implementations of these aggregation functions expect a standard backward-looking window and we support centered windows via a bit of a crutch.

For example, rolling.min eventually falls through to _roll_min_max_variable in aggregations.pyx, which uses this bit of code to record the output:

        for i in range(endi[0], endi[N-1]):
            if not Q.empty() and curr_win_size > 0:
                output[i-1+close_offset] = calc_mm(
                    minp, nobs, values[Q.front()])
            else:
                output[i-1+close_offset] = NaN

This indexing of output means that the window minimum gets written near the end of the window, even if the window is forward-looking. I’ve investigated a bit, and there is a similar issue in rolling.std - it also isn’t adapted to more flexible rolling windows.

While it’s not possible to make rolling window aggregation functions completely universal without loss of efficiency, it’s possible to adapt them to most useful cases: forward-looking, smoothly contracting and expanding. We’d still have to think on how we would check that we support a custom Indexer, and whether we would check at all. It might be possible to just specify the supported kinds in the docs and throw a warning or do something similar.

If we choose this path, I’d be happy to deal with the problem over a series of PRs or share the load with someone. Looks like a fair bit of work, but the pandemic freed up a lot of time.

Expected Output

OUT:
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    NaN
Name: a, dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d308712c8edef078524b8a65df7cb74e9019218e python : 3.7.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.18362 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : ru_RU.UTF-8 LOCALE : None.None

pandas : 0.26.0.dev0+2635.gd308712c8 numpy : 1.17.5 pytz : 2019.3 dateutil : 2.8.1 pip : 19.3.1 setuptools : 44.0.0.post20200106 Cython : 0.29.14 pytest : 5.3.4 hypothesis : 5.2.0 sphinx : 2.3.1 blosc : None feather : None xlsxwriter : 1.2.7 lxml.etree : 4.4.2 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.11.1 pandas_datareader: None bs4 : 4.8.2 bottleneck : 1.3.1 fastparquet : None gcsfs : None matplotlib : 3.1.2 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.1 pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : 0.4.0 scipy : 1.3.1 sqlalchemy : 1.3.12 tables : 3.6.1 tabulate : 0.8.6 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 numba : 0.47.0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
WhistleWhileYouWorkcommented, Mar 29, 2020

Hi, I came here to report this issue as well. I wanted to chime in on a couple things.

First thanks for pandas and this very useful new feature!

I think that at the very least there should be support of window functions for forward-looking windows. Forward-looking windows is an oft-requested feature (here in pandas issues) and rolling Indexer support was a huge step in the right direction.

Without that support, the use of apply is required (as stated above), but the performance hit is orders of magnitude too large. My df using apply(max) takes 6:46 minutes while builtin max() takes a mere 5 seconds. To work around the performance hit, I have to use apply with numba to get it to just over 7 seconds.

Also, I don’t believe agg can be used with numba since numba requires the numpy backing array. So, we also gain the simplicity of agg if window functions are supported.

For now, might I suggest a mention and example in the guide docs at Custom window rolling to use apply as a workaround and optionally numba for performance.

Here is a complete numba example for max:

import numba
import numpy as np
import pandas as pd
from pandas.api.indexers import BaseIndexer

@numba.jit
def numba_max(x):
    return max(x)

class ForwardIndexer(BaseIndexer):
    ''' Custom `Indexer` for use in forward-looking `rolling` windows '''
    def get_window_bounds(self, num_values, min_periods, center, closed):
        ''' Set up forward looking windows '''
        start = np.arange(num_values, dtype=np.int64)
        end = start.copy() + self.window_size
        #---- Clip to `num_values`
        end[end > num_values] = num_values
        return start, end

df = pd.DataFrame({"a": [1,2,3,4,5,6,7,8,9]})
df_max = df.rolling(window=ForwardIndexer(window_size=3), min_periods=1).apply(numba_max, raw=True)
df_max

Thanks again.

1reaction
AlexKirkocommented, Mar 30, 2020

@mroeschke , thank you for implementing the error-raising behavior! I’ll work on the functions one by one, starting tomorrow (one PR per function), and we’ll see how it goes. @WhistleWhileYouWork , thanks for taking interest in this! Another possible workaround to get efficient forward-looking computation right now is to pad the Series appropriately, shift, use normal backward-looking windows, then shift the results back. Hopefully, fixing the problem proves to be within my capabilities, and the question of workaround efficiency becomes moot.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Windowing operations — pandas 1.5.2 documentation
pandas supports 4 types of windowing operations: Rolling window: Generic fixed or variable sliding window over the values. Weighted window: Weighted ...
Read more >
Rolling window and problem with slice indexing - Stack Overflow
While estimating the CQBS_years function, I get this error: "cannot do slice indexing on RangeIndex with these indexers [2010] of type str".
Read more >
Window function calls | BigQuery - Google Cloud
With window functions you can compute moving averages, rank items, calculate cumulative sums, and perform other analyses.
Read more >
Create custom indexes - Splunk Documentation
The main reason you'd set up multiple indexes is to control user access to the data that's in them. When you assign users...
Read more >
How to use Window functions in SQL Server - SQLShack
The main advantage of using Window functions over regular aggregate functions is: Window functions do not cause rows to become grouped into a ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found