DataFrame.rolling does nothing when values are in a list
See original GitHub issueThis issue is based on data from this Stack Overflow post.
First, I get all the values for each day into a list using resample
. I then try and apply a five-day rolling function and the original DataFrame is returned. No calculation happens.
>>> d = {'favorable': [0.48, 0.51, 0.56, 0.51, 0.48, 0.46, 0.48, 0.49, 0.53, 0.51, 0.49, 0.47, 0.49, 0.53, 0.47, 0.49, 0.52, 0.5, 0.51, 0.51],
'unfavorable': [0.49, 0.48, 0.4, 0.47, 0.49, 0.46, 0.49, 0.48, 0.45, 0.45, 0.49, 0.47, 0.45, 0.39, 0.44, 0.48, 0.46, 0.47, 0.46, 0.41],
'other': [0.03, 0.02, 0.04, 0.02, 0.04, 0.09, 0.03, 0.03, 0.02, 0.04, 0.03, 0.05, 0.06, 0.0, 0.08, 0.03, 0.01, 0.03, 0.02, 0.0]}
>>> index = pd.DatetimeIndex(['2012-10-25', '2012-10-25', '2012-10-26', '2012-10-27', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-30', '2012-11-01', '2012-11-01', '2012-11-01', '2012-11-03', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-05', '2012-11-07'])
>>> df = pd.DataFrame(d, index=index)
>>> df
favorable other unfavorable
2012-10-25 0.48 0.03 0.49
2012-10-25 0.51 0.02 0.48
2012-10-26 0.56 0.04 0.40
2012-10-27 0.51 0.02 0.47
2012-10-28 0.48 0.04 0.49
2012-10-28 0.46 0.09 0.46
2012-10-28 0.48 0.03 0.49
2012-10-28 0.49 0.03 0.48
2012-10-30 0.53 0.02 0.45
2012-11-01 0.51 0.04 0.45
2012-11-01 0.49 0.03 0.49
2012-11-01 0.47 0.05 0.47
2012-11-03 0.49 0.06 0.45
2012-11-04 0.53 0.00 0.39
2012-11-04 0.47 0.08 0.44
2012-11-04 0.49 0.03 0.48
2012-11-04 0.52 0.01 0.46
2012-11-04 0.50 0.03 0.47
2012-11-05 0.51 0.02 0.46
2012-11-07 0.51 0.00 0.41
>>> df1 = df.resample('D').apply(lambda x: x.tolist())
favorable other unfavorable
2012-10-25 [0.48, 0.51] [0.03, 0.02] [0.49, 0.48]
2012-10-26 [0.56] [0.04] [0.4]
2012-10-27 [0.51] [0.02] [0.47]
2012-10-28 [0.48, 0.46, 0.48, 0.49] [0.04, 0.09, 0.03, 0.03] [0.49, 0.46, 0.49, 0.48]
2012-10-29 [] [] []
2012-10-30 [0.53] [0.02] [0.45]
2012-10-31 [] [] []
2012-11-01 [0.51, 0.49, 0.47] [0.04, 0.03, 0.05] [0.45, 0.49, 0.47]
2012-11-02 [] [] []
2012-11-03 [0.49] [0.06] [0.45]
2012-11-04 [0.53, 0.47, 0.49, 0.52, 0.5] [0.0, 0.08, 0.03, 0.01, 0.03] [0.39, 0.44, 0.48, 0.46, 0.47]
2012-11-05 [0.51] [0.02] [0.46]
2012-11-06 [] [] []
2012-11-07 [0.51] [0.0] [0.41]
>>> df1.rolling('5D').count().equals(df1)
True
>>> df1.rolling('5D').sum().equals(df1)
True
>>> df1.rolling('5D').apply(lambda x: x+5).equals(df1)
True
Problem description
To give more context, I wanted to find the five-day rolling average, but the rolling method does not include all the rows for the current date if there are multiple rows with the same date. See this output:
>>> df.rolling('5D').count()
favorable other unfavorable
2012-10-25 1.0 1.0 1.0
2012-10-25 2.0 2.0 2.0
2012-10-26 3.0 3.0 3.0
2012-10-27 4.0 4.0 4.0
2012-10-28 5.0 5.0 5.0
2012-10-28 6.0 6.0 6.0
2012-10-28 7.0 7.0 7.0
2012-10-28 8.0 8.0 8.0
2012-10-30 7.0 7.0 7.0
2012-11-01 6.0 6.0 6.0
2012-11-01 7.0 7.0 7.0
2012-11-01 8.0 8.0 8.0
2012-11-03 5.0 5.0 5.0
2012-11-04 5.0 5.0 5.0
2012-11-04 6.0 6.0 6.0
2012-11-04 7.0 7.0 7.0
2012-11-04 8.0 8.0 8.0
2012-11-04 9.0 9.0 9.0
2012-11-05 10.0 10.0 10.0
2012-11-07 8.0 8.0 8.0
The first row for 10-25-2012 has a count of 1 and the second has a count of 2. You can see this pattern continue for all rows that have the same date. Because of this, I decided to group all the values of the same day into a list with resample
and then use rolling
on that frame to get the desired result. Strangely, the original DataFrame is being returned when there are lists as values.
Also, I think rolling
has lots of room for improvement. I think it would be great to have the following:
- Inclusive of all rows with the same exact timestamp
- Customize window size in either direction for any amount of time. For instance, it could find a rolling average for the 4 previous days along with the following 7 days
- Make it equivalent to
groupby
andresample
- have same methods (there is no size method) and have it pass in pandas objects to the agg/apply methods. Currently, it passes in numpy arrays.
I believe SAS has the capability to do the custom window size.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.3.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.21.0 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.26.1 numpy: 1.13.3 scipy: 0.19.1 pyarrow: None xarray: None IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.1.0 openpyxl: 2.4.8 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.0.2 lxml: 4.1.0 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:13 (10 by maintainers)
Top GitHub Comments
I have a simmilar problem, but with sets:
@jreback if
rolling
over non first class elements is not supportedpandas
should either throw warning of fail with an error. IMO returning an unexpected invalid result is not correct behaviour. I spend a lot of time trying to find out what am I doing wrong that the rolling functions does not return what I expect Do you agree?This wouldn’t work if you wanted an evenly-weighted mean, which is why I wanted to collect all the values together first in a list. You could do it in a roundabout way with
df.resample('D').agg(['sum', 'count'])
and then do it withrolling
. Although, you would not be able to use apply because it only accepts a single column at a time as a numpy array (very strange). You would have to precalculate some total sums and weights. It would be a mess.df.rolling('5D').mean()
would be way, way nicer.