Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deprecation of relabeling dicts in groupby.agg brings many issues

See original GitHub issue

This issue is created based on the discussion from #15931 following the deprecation of relabeling dicts in groupby.agg. A lot of what is summarized below was already discussed in the previous discussion. I would recommend in particular https://github.com/pandas-dev/pandas/pull/15931#issuecomment-336139085 where the problems are also clearly stated.

The motivation behind the deprecation of #15931 was mostly related to bringing a consistent interface for agg() between Series and Dataframe (see also #14668 for context).

The relabeling functionality with a nested dict has been described by some as being too complex and/or inconsistent and thus deprecated.

However, this comes at a price: the impossibility to aggregate and rename at the same time leads to very annoying issues and some backward incompatibility where no sensible workaround is available:

[annoying] no more control over the names of the resulting columns
[annoying] you need to find a way to rename the MultiIndex after performing the aggregation, requiring to keep track of the order of columns at two places in the code… not practical at all and sometimes downright impossible (cases below).
⚠️ [breaking] cannot apply more than one callable with the same internal name on the same input column. This results in two sub-cases:
- [breaking] you can’t apply anymore two or more lambda aggregators on the same column
- [breaking] you can’t apply anymore two or more aggregators from partial functions unless you alter their hidden __name__ attribute

Example

(please note, this is a crafted example for the purpose of demonstrating the problem in as short a code as possible, but all of the demonstrated issues here did bite me in real life since the change, and in situations not as simple as here)

Input Dataframe

mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6)
)

  cat  distance  energy
0   A      1.20    1.80
1   A      1.50    1.95
2   A      1.74    2.04
3   B      0.82    1.25
4   B      1.01    1.60
5   C      0.60    1.01

Before:

easy to write and read, and works as expected

import numpy as np
import statsmodels.robust as smrb
from functools import partial
 
# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)

# renaming and specifying the aggregators at the same time
# note that I want to choose the resulting column names myself
# for example "total_xxxx" instead of just "sum"
mydf_agg = mydf.groupby('cat').agg({
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),  # lambda
        'energy_p17': lambda x: np.percentile(x, 17),  # lambda
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,   # original function
        'distance_mad_c1': mad_c1,  # partial function wrapping the original function
    },
})

results in

          energy                             distance
    total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A           5.79     2.0364     1.8510           4.44            1.480     0.355825           0.240
B           2.85     1.5930     1.3095           1.83            0.915     0.140847           0.095
C           1.01     1.0100     1.0100           0.60            0.600     0.000000           0.000

and all is left is:

# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

Happy dance praising pandas 💃 🕺 !

After

import numpy as np
import statsmodels.robust as smrb
from functools import partial

# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)

# no way of choosing the destination's column names...
mydf_agg = mydf.groupby('cat').agg({
    'energy': [
    	'sum',
    	lambda x: np.percentile(x, 98), # lambda
    	lambda x: np.percentile(x, 17), # lambda
    ],
    'distance': [
    	'sum',
    	'mean',
    	smrb.mad, # original function
    	mad_c1,   # partial function wrapping the original function
    ],
})

The above breaks because the lambda functions will all result in columns named <lambda> which results in

SpecificationError: Function names must be unique, found multiple named <lambda>

Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore.

If one removes the lambda x: np.percentile(x, 98) from above, we get the same issue with the partial function which inherits the function name from the original function:

SpecificationError: Function names must be unique, found multiple named mad

Finally, after overwriting the __name__ attribute of the partial (for example with mad_c1.__name__ = 'mad_c1') we get:

    energy          distance
       sum <lambda>      sum   mean       mad mad_c1
cat
A     5.79   1.8510     4.44  1.480  0.355825  0.240
B     2.85   1.3095     1.83  0.915  0.140847  0.095
C     1.01   1.0100     0.60  0.600  0.000000  0.000

with still

one column missing (98th percentile)
the handling of the MultiIndex columns
and the renaming of the columns

to deal with in separate step.

There is no control possible for the column names after aggregation, the best we can get in an automated way is some combination of original column name and the aggregate function’s name like this:

mydf_agg.columns = ['_'.join(col) for col in mydf_agg.columns]

which results in:

     energy_sum  energy_<lambda>  distance_sum  distance_mean  distance_mad distance_mad_c1
cat
A          5.79           1.8510          4.44          1.480      0.355825           0.240
B          2.85           1.3095          1.83          0.915      0.140847           0.095
C          1.01           1.0100          0.60          0.600      0.000000           0.000

and if you really need to have different names, you can do it like this:

mydf_agg.rename({
    "energy_sum": "total_energy",
    "energy_<lambda>": "energy_p17",
    "distance_sum": "total_distance",
    "distance_mean": "average_distance"
    }, inplace=True)

but that means that you need to be careful to keep the renaming code (which must now be located at another place in the code) in sync with the code where the aggregation is defined…

Sad pandas user 😢 (which still loves pandas of course)

I am all in for consistency, and at the same time I deeply regret the deprecation of the aggregate and rename functionality. I hope the examples above make the pain points clear.

Possible solutions

Un-deprecate the dict-of-dict relabeling functionality
Provide another API to be able to do it (but why should there be two methods for the same main purpose, namely aggregation?)
??? (open to suggestions)

Optional read:

With respect to the aforementioned discussion in the pull request which has been going on already for a few months, I only recently realized one of the reasons why I am so bothered by this deprecation: “aggregate and rename” is a natural thing to do with GROUP BY aggregations in SQL since in SQL you usually provide the destination column name directly next to the aggregation expression, e.g. SELECT col1, avg(col2) AS col2_mean, stddev(col2) AS col2_var FROM mytable GROUP BY col1.

I’m not saying that Pandas should necessarily provide the same functionalities as SQL of course. But the examples provided above demonstrate why the dict-of-dict API was in my opinion a clean and simple solution to many use-cases.

(* I don’t personally agree that the dict-of-dict approach is complex.)

Issue Analytics

State:
Created 6 years ago
Reactions:56
Comments:37 (31 by maintainers)

Top GitHub Comments

15reactions

tdpetroucommented, Nov 23, 2017

I agree that renaming with the current agg implementation is very clunky and broken in this example. The nested dicts are somewhat complex but writing them as you did makes it very clear what is happening.

I suppose there could be a names parameter added to agg which would take dictionary mapping the aggregating columns to their new names. You could even add another parameter drop_index as a boolean to determine whether to keep the upper index level.

So the syntax would turn into:

agg_dict = {'energy': ['sum',
                       lambda x: np.percentile(x, 98), # lambda
                       lambda x: np.percentile(x, 17), # lambda
                      ],
            'distance': ['sum',
                         'mean',
                         smrb.mad, # original function
                         mad_c1,   # partial function wrapping the original function
                        ]
           }

name_dict = {'energy':['energy_sum', 'energy_p98', 'energy_p17'],
             'distance':['distance_sum', 'distance_mean', 'distance_mad', 'distance_mad_c1']}


mydf.groupby('cat').agg(agg_dict, names=name_dict, drop_index=True)

Or maybe, an entire new method agg_assign could be created, which would work similarly to DataFrame.assign:

mydf.groupby('cat').agg_assign(energy_sum=lambda x: x.energy.sum(),
                               energy_p98=lambda x: np.percentile(x.energy, 98),
                               energy_p17=lambda x: np.percentile(x.energy, 17),
                               distance_sum=lambda x: x.distance.sum(),
                               distance_mean=lambda x: x.distance.mean(),
                               distance_mad=lambda x: smrb.mad(x.distance),
                               distance_mad_c1=lambda x: mad_c1(x.distance))

I actually like this option much better.

13reactions

smcateercommented, Nov 24, 2017

For what it’s worth, I am also strongly in favour of not depreciating the functionality.

A big reason for me is that there is something deeply queer about mixing the Python’s function name-space (something to do with the particular implementation) with the data the column names (something that should surely not know about the implementation). The fact that we are seeing columns (potentially multiple columns) named '<lambda>' is causes me severe cognitive dissonance.

The renaming approach grates, because there is this intermediary step where unnecessary (and exposed) column names are carried around. Furthermore, they are difficult to reliably, systematically rename because there are potentially dependencies on the implementation.

Aside from that, the nested dict functionality is admittedly complex, but it is a complex operation that is being performed.

TL;DR Please don’t depreciate. 😃

Top Results From Across the Web

Rename result columns from Pandas aggregation ...

Use groupby apply and return a Series to rename columns. Use the groupby apply method to perform an aggregation that.

Deprecation of using a dict in groupby().agg({...}) to rename ...

Hi all,. As a heads up, we are deprecating the use of a dictionary in the agg method on groupby/resample/rolling objects. Basic example:...

What's New — pandas 0.20.2 documentation - PyData |

This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with...

Group and Aggregate your Data Better using Pandas Groupby

Renaming of variables using dictionaries within the agg() function as in the diagram below is being deprecated/removed from Pandas – see notes. Aggregating ......

pandas.DataFrameで、MultiIndexのカラムをフラットな形に ...

あるカテゴリでGroupbyをしてsumとmeanを取った後、カラム名を ... Deprecation of relabeling dicts in groupby.agg brings many issues #18366.