Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: Group By Grouping set /Cube/ Rollup

See original GitHub issue

Problem description

Basically these are performance tools in SQL to get analysis in multiple dimensions and they are missing in Pandas out of the box. Some of these can be achieved by a pivot table and melt/stack functions but being tools for analysis these functions should be a must and it also decreases the number of lines of code.

Group by Grouping set will help to rewrite the query with multiple groups by clauses combined with union statements into a single query. Cube is shorthand notation of grouping sets if the user chooses all the combinations of the fields listed in the cube clause

SELECT
    column1,
    column2,
    aggregate_function (column3)
FROM
    table_name
GROUP BY
    GROUPING SETS (
        (column1, column2),
        (column1),
        (column2),
        ()
);

Select   column1,
            column2,
            column3,
            column4,
            aggregate_function (column5)
from table
group by column1, column2, cube (column3,column4)```

Current way
```pseudo code
  a= <pandas dataframe>
  a1 = a.groupby([column1]).sum(column5)
  a2  = a.groupby([column1,column2]).sum(column5)
   ...
  an = a.groupby([column1,...,columnn]).sum(column5)
 result= union(a1,a2,......an)

Expected way

  a= <pandas dataframe>
  
  gropby_cube1 = a.gropby([column1,column2]).cube([column3,.....,columnn]).sum(column5)
   gropby_cube2 = a.gropby.cube([column1,column2,.....,columnn]).sum(column5)

   gropby_sets1 = a.gropby.sets( {column1,column2} ,{column1,column2,column3} ,{}).sum(column5)
   gropby_sets2 = a.gropby([column1,column2).sets({column1,column2,column3} ,{} ).sum(column5)

   gropby_rollup1 = a.gropby.rollup({column1,column2,column3}).sum(column5)
   gropby_rollup2 = a.gropby([column1,column2).rollup({column3} ).sum(column5)

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:10 (3 by maintainers)

Top GitHub Comments

5reactions

rsdpyenugulacommented, Nov 9, 2019

@jreback , I not sure why it’s complicated. It’s already been implemented in different open sources DB like PostgreSQL So it would be a similar approach so no need to reinvent the wheel. As I said above these function are implemented such way it helps the performance rather than the concat way.
pandas was known as an Analytics tool so I strongly say these functions should be out of the box. If this API was implemented it would also help other libraries built on pandas API. For example, Dask. Every software will have some bug’s and they will be fixed in further iterations. In my above example, you can see 2 different API styles. 1st style is the existing pandas style. If you choose the second style it is a totally new API so even if it has bug it will not affect other API’s and once this is fully implemented without bugs the other existing API can be deprecated.

1reaction

jrebackcommented, Feb 22, 2020

@rsdpyenugula if you really want to this then you should edit the top with detailed examples (quality not quantity matters here); doc-strings, and typed function signatures

a POC implementation PR would also be nice to have

pandas is all volunteer and folks have limited time with 3000+ issues; so contributions are the only thing to move this forward