BUG: rolling window functions don't support custom indexers
See original GitHub issueCode Sample, a copy-pastable example if possible
class ForwardIndexer(BaseIndexer):
def get_window_bounds(self, num_values, min_periods, center, closed):
start = np.empty(num_values, dtype=np.int64)
end = np.empty(num_values, dtype=np.int64)
for i in range(num_values):
if i + min_periods <= num_values:
start[i] = i
end[i] = min(i + self.window_size, num_values)
else:
start[i] = i
end[i] = i + 1
return start, end
x = pd.DataFrame({"a": [1,2,3,4,5,6,7,8,9]})
rolling = x["a"].rolling(ForwardIndexer(window_size=3), min_periods=2)
result = rolling.min()
result
OUT:
0 0.0
1 0.0
2 1.0
3 2.0
4 3.0
5 4.0
6 5.0
7 7.0
8 NaN
Name: a, dtype: float64
IN:
expected = rolling.apply(lambda x: min(x))
expected
OUT:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 NaN
Name: a, dtype: float64
Problem description
We state here that we support supplying a custom Indexer when building a pandas.DataFrame.rolling
object. While the object does get built, and it returns the correct windows, it doesn’t support many rolling window functions. The problem is that our implementations of these aggregation functions expect a standard backward-looking window and we support centered windows via a bit of a crutch.
For example, rolling.min
eventually falls through to _roll_min_max_variable
in aggregations.pyx
, which uses this bit of code to record the output:
for i in range(endi[0], endi[N-1]):
if not Q.empty() and curr_win_size > 0:
output[i-1+close_offset] = calc_mm(
minp, nobs, values[Q.front()])
else:
output[i-1+close_offset] = NaN
This indexing of output means that the window minimum gets written near the end of the window, even if the window is forward-looking. I’ve investigated a bit, and there is a similar issue in rolling.std
- it also isn’t adapted to more flexible rolling windows.
While it’s not possible to make rolling window aggregation functions completely universal without loss of efficiency, it’s possible to adapt them to most useful cases: forward-looking, smoothly contracting and expanding. We’d still have to think on how we would check that we support a custom Indexer, and whether we would check at all. It might be possible to just specify the supported kinds in the docs and throw a warning or do something similar.
If we choose this path, I’d be happy to deal with the problem over a series of PRs or share the load with someone. Looks like a fair bit of work, but the pandemic freed up a lot of time.
Expected Output
OUT:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 NaN
Name: a, dtype: float64
Output of pd.show_versions()
INSTALLED VERSIONS
commit : d308712c8edef078524b8a65df7cb74e9019218e python : 3.7.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.18362 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : ru_RU.UTF-8 LOCALE : None.None
pandas : 0.26.0.dev0+2635.gd308712c8 numpy : 1.17.5 pytz : 2019.3 dateutil : 2.8.1 pip : 19.3.1 setuptools : 44.0.0.post20200106 Cython : 0.29.14 pytest : 5.3.4 hypothesis : 5.2.0 sphinx : 2.3.1 blosc : None feather : None xlsxwriter : 1.2.7 lxml.etree : 4.4.2 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.11.1 pandas_datareader: None bs4 : 4.8.2 bottleneck : 1.3.1 fastparquet : None gcsfs : None matplotlib : 3.1.2 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.1 pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : 0.4.0 scipy : 1.3.1 sqlalchemy : 1.3.12 tables : 3.6.1 tabulate : 0.8.6 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 numba : 0.47.0
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:15 (14 by maintainers)
Top GitHub Comments
Hi, I came here to report this issue as well. I wanted to chime in on a couple things.
First thanks for
pandas
and this very useful new feature!I think that at the very least there should be support of window functions for forward-looking windows. Forward-looking windows is an oft-requested feature (here in pandas issues) and rolling Indexer support was a huge step in the right direction.
Without that support, the use of
apply
is required (as stated above), but the performance hit is orders of magnitude too large. My df usingapply(max)
takes 6:46 minutes while builtinmax()
takes a mere 5 seconds. To work around the performance hit, I have to useapply
withnumba
to get it to just over 7 seconds.Also, I don’t believe
agg
can be used withnumba
sincenumba
requires thenumpy
backing array. So, we also gain the simplicity ofagg
if window functions are supported.For now, might I suggest a mention and example in the guide docs at Custom window rolling to use
apply
as a workaround and optionallynumba
for performance.Here is a complete
numba
example formax
:Thanks again.
@mroeschke , thank you for implementing the error-raising behavior! I’ll work on the functions one by one, starting tomorrow (one PR per function), and we’ll see how it goes. @WhistleWhileYouWork , thanks for taking interest in this! Another possible workaround to get efficient forward-looking computation right now is to pad the Series appropriately, shift, use normal backward-looking windows, then shift the results back. Hopefully, fixing the problem proves to be within my capabilities, and the question of workaround efficiency becomes moot.