Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame.rolling does nothing when values are in a list

See original GitHub issue

This issue is based on data from this Stack Overflow post.

First, I get all the values for each day into a list using resample. I then try and apply a five-day rolling function and the original DataFrame is returned. No calculation happens.

>>> d = {'favorable': [0.48, 0.51, 0.56, 0.51, 0.48, 0.46, 0.48, 0.49, 0.53, 0.51, 0.49, 0.47, 0.49, 0.53, 0.47, 0.49, 0.52, 0.5, 0.51, 0.51], 
         'unfavorable': [0.49, 0.48, 0.4, 0.47, 0.49, 0.46, 0.49, 0.48, 0.45, 0.45, 0.49, 0.47, 0.45, 0.39, 0.44, 0.48, 0.46, 0.47, 0.46, 0.41], 
         'other': [0.03, 0.02, 0.04, 0.02, 0.04, 0.09, 0.03, 0.03, 0.02, 0.04, 0.03, 0.05, 0.06, 0.0, 0.08, 0.03, 0.01, 0.03, 0.02, 0.0]}
>>> index = pd.DatetimeIndex(['2012-10-25', '2012-10-25', '2012-10-26', '2012-10-27', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-30', '2012-11-01', '2012-11-01', '2012-11-01', '2012-11-03', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-05', '2012-11-07'])
>>> df = pd.DataFrame(d, index=index)
>>> df

            favorable  other  unfavorable
2012-10-25       0.48   0.03         0.49
2012-10-25       0.51   0.02         0.48
2012-10-26       0.56   0.04         0.40
2012-10-27       0.51   0.02         0.47
2012-10-28       0.48   0.04         0.49
2012-10-28       0.46   0.09         0.46
2012-10-28       0.48   0.03         0.49
2012-10-28       0.49   0.03         0.48
2012-10-30       0.53   0.02         0.45
2012-11-01       0.51   0.04         0.45
2012-11-01       0.49   0.03         0.49
2012-11-01       0.47   0.05         0.47
2012-11-03       0.49   0.06         0.45
2012-11-04       0.53   0.00         0.39
2012-11-04       0.47   0.08         0.44
2012-11-04       0.49   0.03         0.48
2012-11-04       0.52   0.01         0.46
2012-11-04       0.50   0.03         0.47
2012-11-05       0.51   0.02         0.46
2012-11-07       0.51   0.00         0.41

>>> df1 = df.resample('D').apply(lambda x: x.tolist())

                                favorable                          other                     unfavorable
2012-10-25                   [0.48, 0.51]                   [0.03, 0.02]                    [0.49, 0.48]
2012-10-26                         [0.56]                         [0.04]                           [0.4]
2012-10-27                         [0.51]                         [0.02]                          [0.47] 
2012-10-28       [0.48, 0.46, 0.48, 0.49]       [0.04, 0.09, 0.03, 0.03]        [0.49, 0.46, 0.49, 0.48]  
2012-10-29                             []                             []                              []
2012-10-30                         [0.53]                         [0.02]                          [0.45]
2012-10-31                             []                             []                              []
2012-11-01             [0.51, 0.49, 0.47]             [0.04, 0.03, 0.05]              [0.45, 0.49, 0.47] 
2012-11-02                             []                             []                              []
2012-11-03                         [0.49]                         [0.06]                          [0.45]
2012-11-04  [0.53, 0.47, 0.49, 0.52, 0.5]  [0.0, 0.08, 0.03, 0.01, 0.03]  [0.39, 0.44, 0.48, 0.46, 0.47]
2012-11-05                         [0.51]                         [0.02]                          [0.46]
2012-11-06                             []                             []                              []
2012-11-07                         [0.51]                          [0.0]                          [0.41]

>>> df1.rolling('5D').count().equals(df1)
True

>>> df1.rolling('5D').sum().equals(df1)
True

>>> df1.rolling('5D').apply(lambda x: x+5).equals(df1)
True

Problem description

To give more context, I wanted to find the five-day rolling average, but the rolling method does not include all the rows for the current date if there are multiple rows with the same date. See this output:

>>> df.rolling('5D').count()
            favorable  other  unfavorable
2012-10-25        1.0    1.0          1.0
2012-10-25        2.0    2.0          2.0
2012-10-26        3.0    3.0          3.0
2012-10-27        4.0    4.0          4.0
2012-10-28        5.0    5.0          5.0
2012-10-28        6.0    6.0          6.0
2012-10-28        7.0    7.0          7.0
2012-10-28        8.0    8.0          8.0
2012-10-30        7.0    7.0          7.0
2012-11-01        6.0    6.0          6.0
2012-11-01        7.0    7.0          7.0
2012-11-01        8.0    8.0          8.0
2012-11-03        5.0    5.0          5.0
2012-11-04        5.0    5.0          5.0
2012-11-04        6.0    6.0          6.0
2012-11-04        7.0    7.0          7.0
2012-11-04        8.0    8.0          8.0
2012-11-04        9.0    9.0          9.0
2012-11-05       10.0   10.0         10.0
2012-11-07        8.0    8.0          8.0

The first row for 10-25-2012 has a count of 1 and the second has a count of 2. You can see this pattern continue for all rows that have the same date. Because of this, I decided to group all the values of the same day into a list with resample and then use rolling on that frame to get the desired result. Strangely, the original DataFrame is being returned when there are lists as values.

Also, I think rolling has lots of room for improvement. I think it would be great to have the following:

Inclusive of all rows with the same exact timestamp
Customize window size in either direction for any amount of time. For instance, it could find a rolling average for the 4 previous days along with the following 7 days
Make it equivalent to groupby and resample - have same methods (there is no size method) and have it pass in pandas objects to the agg/apply methods. Currently, it passes in numpy arrays.

I believe SAS has the capability to do the custom window size.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.3.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.26.1 numpy: 1.13.3 scipy: 0.19.1 pyarrow: None xarray: None IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.1.0 openpyxl: 2.4.8 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.0.2 lxml: 4.1.0 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Comments:13 (10 by maintainers)

Top GitHub Comments

1reaction

letalvojcommented, Nov 29, 2017

I have a simmilar problem, but with sets:

import pandas as pd
pd.Series(data=[{1},{2},{3},{4}], index=[1,2,3,4]).rolling(2).apply(list)

# yields:
# 1    {1}
# 2    {2}
# 3    {3}
# 4    {4}
# dtype: object

# yet I should be something like:
# 1    None
# 2    [{1},{2}]
# 3    [{2},{3}]
# 4    [{3},{4}]
# dtype: object

@jreback if rolling over non first class elements is not supported pandas should either throw warning of fail with an error. IMO returning an unexpected invalid result is not correct behaviour. I spend a lot of time trying to find out what am I doing wrong that the rolling functions does not return what I expect Do you agree?

1reaction

tdpetroucommented, Nov 6, 2017

This wouldn’t work if you wanted an evenly-weighted mean, which is why I wanted to collect all the values together first in a list. You could do it in a roundabout way with df.resample('D').agg(['sum', 'count']) and then do it with rolling. Although, you would not be able to use apply because it only accepts a single column at a time as a numpy array (very strange). You would have to precalculate some total sums and weights. It would be a mess. df.rolling('5D').mean() would be way, way nicer.