question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame.rolling does nothing when values are in a list

See original GitHub issue

This issue is based on data from this Stack Overflow post.

First, I get all the values for each day into a list using resample. I then try and apply a five-day rolling function and the original DataFrame is returned. No calculation happens.

>>> d = {'favorable': [0.48, 0.51, 0.56, 0.51, 0.48, 0.46, 0.48, 0.49, 0.53, 0.51, 0.49, 0.47, 0.49, 0.53, 0.47, 0.49, 0.52, 0.5, 0.51, 0.51], 
         'unfavorable': [0.49, 0.48, 0.4, 0.47, 0.49, 0.46, 0.49, 0.48, 0.45, 0.45, 0.49, 0.47, 0.45, 0.39, 0.44, 0.48, 0.46, 0.47, 0.46, 0.41], 
         'other': [0.03, 0.02, 0.04, 0.02, 0.04, 0.09, 0.03, 0.03, 0.02, 0.04, 0.03, 0.05, 0.06, 0.0, 0.08, 0.03, 0.01, 0.03, 0.02, 0.0]}
>>> index = pd.DatetimeIndex(['2012-10-25', '2012-10-25', '2012-10-26', '2012-10-27', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-30', '2012-11-01', '2012-11-01', '2012-11-01', '2012-11-03', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-05', '2012-11-07'])
>>> df = pd.DataFrame(d, index=index)
>>> df

            favorable  other  unfavorable
2012-10-25       0.48   0.03         0.49
2012-10-25       0.51   0.02         0.48
2012-10-26       0.56   0.04         0.40
2012-10-27       0.51   0.02         0.47
2012-10-28       0.48   0.04         0.49
2012-10-28       0.46   0.09         0.46
2012-10-28       0.48   0.03         0.49
2012-10-28       0.49   0.03         0.48
2012-10-30       0.53   0.02         0.45
2012-11-01       0.51   0.04         0.45
2012-11-01       0.49   0.03         0.49
2012-11-01       0.47   0.05         0.47
2012-11-03       0.49   0.06         0.45
2012-11-04       0.53   0.00         0.39
2012-11-04       0.47   0.08         0.44
2012-11-04       0.49   0.03         0.48
2012-11-04       0.52   0.01         0.46
2012-11-04       0.50   0.03         0.47
2012-11-05       0.51   0.02         0.46
2012-11-07       0.51   0.00         0.41

>>> df1 = df.resample('D').apply(lambda x: x.tolist())

                                favorable                          other                     unfavorable
2012-10-25                   [0.48, 0.51]                   [0.03, 0.02]                    [0.49, 0.48]
2012-10-26                         [0.56]                         [0.04]                           [0.4]
2012-10-27                         [0.51]                         [0.02]                          [0.47] 
2012-10-28       [0.48, 0.46, 0.48, 0.49]       [0.04, 0.09, 0.03, 0.03]        [0.49, 0.46, 0.49, 0.48]  
2012-10-29                             []                             []                              []
2012-10-30                         [0.53]                         [0.02]                          [0.45]
2012-10-31                             []                             []                              []
2012-11-01             [0.51, 0.49, 0.47]             [0.04, 0.03, 0.05]              [0.45, 0.49, 0.47] 
2012-11-02                             []                             []                              []
2012-11-03                         [0.49]                         [0.06]                          [0.45]
2012-11-04  [0.53, 0.47, 0.49, 0.52, 0.5]  [0.0, 0.08, 0.03, 0.01, 0.03]  [0.39, 0.44, 0.48, 0.46, 0.47]
2012-11-05                         [0.51]                         [0.02]                          [0.46]
2012-11-06                             []                             []                              []
2012-11-07                         [0.51]                          [0.0]                          [0.41]

>>> df1.rolling('5D').count().equals(df1)
True

>>> df1.rolling('5D').sum().equals(df1)
True

>>> df1.rolling('5D').apply(lambda x: x+5).equals(df1)
True

Problem description

To give more context, I wanted to find the five-day rolling average, but the rolling method does not include all the rows for the current date if there are multiple rows with the same date. See this output:

>>> df.rolling('5D').count()
            favorable  other  unfavorable
2012-10-25        1.0    1.0          1.0
2012-10-25        2.0    2.0          2.0
2012-10-26        3.0    3.0          3.0
2012-10-27        4.0    4.0          4.0
2012-10-28        5.0    5.0          5.0
2012-10-28        6.0    6.0          6.0
2012-10-28        7.0    7.0          7.0
2012-10-28        8.0    8.0          8.0
2012-10-30        7.0    7.0          7.0
2012-11-01        6.0    6.0          6.0
2012-11-01        7.0    7.0          7.0
2012-11-01        8.0    8.0          8.0
2012-11-03        5.0    5.0          5.0
2012-11-04        5.0    5.0          5.0
2012-11-04        6.0    6.0          6.0
2012-11-04        7.0    7.0          7.0
2012-11-04        8.0    8.0          8.0
2012-11-04        9.0    9.0          9.0
2012-11-05       10.0   10.0         10.0
2012-11-07        8.0    8.0          8.0

The first row for 10-25-2012 has a count of 1 and the second has a count of 2. You can see this pattern continue for all rows that have the same date. Because of this, I decided to group all the values of the same day into a list with resample and then use rolling on that frame to get the desired result. Strangely, the original DataFrame is being returned when there are lists as values.

Also, I think rolling has lots of room for improvement. I think it would be great to have the following:

  • Inclusive of all rows with the same exact timestamp
  • Customize window size in either direction for any amount of time. For instance, it could find a rolling average for the 4 previous days along with the following 7 days
  • Make it equivalent to groupby and resample - have same methods (there is no size method) and have it pass in pandas objects to the agg/apply methods. Currently, it passes in numpy arrays.

I believe SAS has the capability to do the custom window size.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.3.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.26.1 numpy: 1.13.3 scipy: 0.19.1 pyarrow: None xarray: None IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.1.0 openpyxl: 2.4.8 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.0.2 lxml: 4.1.0 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
letalvojcommented, Nov 29, 2017

I have a simmilar problem, but with sets:

import pandas as pd
pd.Series(data=[{1},{2},{3},{4}], index=[1,2,3,4]).rolling(2).apply(list)

# yields:
# 1    {1}
# 2    {2}
# 3    {3}
# 4    {4}
# dtype: object

# yet I should be something like:
# 1    None
# 2    [{1},{2}]
# 3    [{2},{3}]
# 4    [{3},{4}]
# dtype: object

@jreback if rolling over non first class elements is not supported pandas should either throw warning of fail with an error. IMO returning an unexpected invalid result is not correct behaviour. I spend a lot of time trying to find out what am I doing wrong that the rolling functions does not return what I expect Do you agree?

1reaction
tdpetroucommented, Nov 6, 2017

This wouldn’t work if you wanted an evenly-weighted mean, which is why I wanted to collect all the values together first in a list. You could do it in a roundabout way with df.resample('D').agg(['sum', 'count']) and then do it with rolling. Although, you would not be able to use apply because it only accepts a single column at a time as a numpy array (very strange). You would have to precalculate some total sums and weights. It would be a mess. df.rolling('5D').mean() would be way, way nicer.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why doesn't my pandas rolling().apply() work when the series ...
1 Answer 1 ... This will not work because the pandas.DataFrame.rolling function returns a Window or Rolling sub-classed for the particular ...
Read more >
pandas.DataFrame.rolling — pandas 1.5.2 documentation
For a DataFrame, a column label or Index level on which to calculate the rolling window, rather than the DataFrame's index.
Read more >
Python | Pandas dataframe.rolling() - GeeksforGeeks
Pandas dataframe.rolling() function provides the feature of rolling window calculations. ... So all the values will be evenly weighted.
Read more >
Python - How do pandas Rolling objects work?
The rolling function in pandas operates on pandas data frame columns independently. It is more than a python iterator and the most important ......
Read more >
The Pandas DataFrame: Make Working With Data Delightful
Again, the dictionary keys are the column labels, and the dictionary values are the data values in the DataFrame. You can also use...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found