Rolling groupby should not maintain the by column in the resulting DataFrame
See original GitHub issueI found another oddity while digging through #13966.
Begin with the initial DataFrame in that issue:
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.arange(40)})
Save the grouping:
In [215]: g = df.groupby('A')
Compute the rolling sum:
In [216]: r = g.rolling(4)
In [217]: r.sum()
Out[217]:
A B
A
1 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 6.0
4 4.0 10.0
5 4.0 14.0
6 4.0 18.0
7 4.0 22.0
8 4.0 26.0
9 4.0 30.0
... ... ...
2 30 8.0 114.0
31 8.0 118.0
3 32 NaN NaN
33 NaN NaN
34 NaN NaN
35 12.0 134.0
36 12.0 138.0
37 12.0 142.0
38 12.0 146.0
39 12.0 150.0
[40 rows x 2 columns]
It maintains the by
column (A
)! That column should not be in the resulting DataFrame.
It gets weirder if I compute the sum over the entire grouping and then re-do the rolling calculation. Now by
column is gone as expected:
In [218]: g.sum()
Out[218]:
B
A
1 190
2 306
3 284
In [219]: r.sum()
Out[219]:
B
A
1 0 NaN
1 NaN
2 NaN
3 6.0
4 10.0
5 14.0
6 18.0
7 22.0
8 26.0
9 30.0
... ...
2 30 114.0
31 118.0
3 32 NaN
33 NaN
34 NaN
35 134.0
36 138.0
37 142.0
38 146.0
39 150.0
[40 rows x 1 columns]
So the grouping summation has some sort of side effect.
Issue Analytics
- State:
- Created 7 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
pandas - Python - rolling functions for GroupBy object
curiously, it seems that the new .rolling().mean() approach returns a multi-indexed series, indexed by the group_by column first and then the ...
Read more >Group by: split-apply-combine — pandas 1.5.2 documentation
This means that the output column ordering would not be consistent. To ensure consistent ordering, the keys (and so output columns) will always...
Read more >Python Pandas: Rolling functions for GroupBy object
To roll the groupby sum to work with the grouped objects, we will first groupby and sum the Dataframe and then we will...
Read more >dask.dataframe.groupby.SeriesGroupBy.rolling
Provides rolling transformations. ... Since MultiIndexes are not well supported in Dask, this method returns a dataframe with the same index as the...
Read more >Pandas groupby() and sum() With Examples
Use DataFrame.groupby().sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The problem still exists in v1.0.1
Still the same problem in 0.25.
Workaround: df.groupby(‘A’).rolling(4).sum().reset_index(level=0, drop=True)