ENH: Group By Grouping set /Cube/ Rollup
See original GitHub issueProblem description
Basically these are performance tools in SQL to get analysis in multiple dimensions and they are missing in Pandas out of the box. Some of these can be achieved by a pivot table and melt/stack functions but being tools for analysis these functions should be a must and it also decreases the number of lines of code.
Group by Grouping set will help to rewrite the query with multiple groups by clauses combined with union statements into a single query. Cube is shorthand notation of grouping sets if the user chooses all the combinations of the fields listed in the cube clause
SELECT
column1,
column2,
aggregate_function (column3)
FROM
table_name
GROUP BY
GROUPING SETS (
(column1, column2),
(column1),
(column2),
()
);
Select column1,
column2,
column3,
column4,
aggregate_function (column5)
from table
group by column1, column2, cube (column3,column4)```
Current way
```pseudo code
a= <pandas dataframe>
a1 = a.groupby([column1]).sum(column5)
a2 = a.groupby([column1,column2]).sum(column5)
...
an = a.groupby([column1,...,columnn]).sum(column5)
result= union(a1,a2,......an)
Expected way
a= <pandas dataframe>
gropby_cube1 = a.gropby([column1,column2]).cube([column3,.....,columnn]).sum(column5)
gropby_cube2 = a.gropby.cube([column1,column2,.....,columnn]).sum(column5)
gropby_sets1 = a.gropby.sets( {column1,column2} ,{column1,column2,column3} ,{}).sum(column5)
gropby_sets2 = a.gropby([column1,column2).sets({column1,column2,column3} ,{} ).sum(column5)
gropby_rollup1 = a.gropby.rollup({column1,column2,column3}).sum(column5)
gropby_rollup2 = a.gropby([column1,column2).rollup({column3} ).sum(column5)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:10 (3 by maintainers)
Top Results From Across the Web
Examples of grouping sets, cube, and rollup queries - IBM
The following examples illustrate the grouping, cube, and rollup forms of ... Example 1: Here is a query with a basic GROUP BY...
Read more >Group By in SQL Server with CUBE, ROLLUP and ...
The GROUP BY clause in SQL Server allows grouping of rows of a query. Generally, GROUP BY is used with an aggregate SQL...
Read more >Dashboards and GROUPING SETS - Max Halford
You can use CUBE when you want to group on all the combinations of dimensions. It's a good default mode when you're not...
Read more >Enhanced Aggregation, Cube, Grouping and Rollup
The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set. All...
Read more >GROUPING SETS and COLLECT Don't Get Along - DBoriented
While reviewing some code a few days ago, I saw a query of the following form: select 'X='||x, collect(z) from t group by...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@jreback , I not sure why it’s complicated. It’s already been implemented in different open sources DB like PostgreSQL So it would be a similar approach so no need to reinvent the wheel. As I said above these function are implemented such way it helps the performance rather than the concat way.
pandas was known as an Analytics tool so I strongly say these functions should be out of the box. If this API was implemented it would also help other libraries built on pandas API. For example, Dask. Every software will have some bug’s and they will be fixed in further iterations. In my above example, you can see 2 different API styles. 1st style is the existing pandas style. If you choose the second style it is a totally new API so even if it has bug it will not affect other API’s and once this is fully implemented without bugs the other existing API can be deprecated.
@rsdpyenugula if you really want to this then you should edit the top with detailed examples (quality not quantity matters here); doc-strings, and typed function signatures
a POC implementation PR would also be nice to have
pandas is all volunteer and folks have limited time with 3000+ issues; so contributions are the only thing to move this forward