Deprecation of relabeling dicts in groupby.agg brings many issues
See original GitHub issueThis issue is created based on the discussion from #15931 following the deprecation of relabeling dicts in groupby.agg
. A lot of what is summarized below was already discussed in the previous discussion. I would recommend in particular https://github.com/pandas-dev/pandas/pull/15931#issuecomment-336139085 where the problems are also clearly stated.
The motivation behind the deprecation of #15931 was mostly related to bringing a consistent interface for agg()
between Series and Dataframe (see also #14668 for context).
The relabeling functionality with a nested dict has been described by some as being too complex and/or inconsistent and thus deprecated.
However, this comes at a price: the impossibility to aggregate and rename at the same time leads to very annoying issues and some backward incompatibility where no sensible workaround is available:
- [annoying] no more control over the names of the resulting columns
- [annoying] you need to find a way to rename the MultiIndex after performing the aggregation, requiring to keep track of the order of columns at two places in the code… not practical at all and sometimes downright impossible (cases below).
- ⚠️ [breaking] cannot apply more than one callable with the same internal name on the same input column. This results in two sub-cases:
- [breaking] you can’t apply anymore two or more lambda aggregators on the same column
- [breaking] you can’t apply anymore two or more aggregators from partial functions unless you alter their hidden
__name__
attribute
Example
(please note, this is a crafted example for the purpose of demonstrating the problem in as short a code as possible, but all of the demonstrated issues here did bite me in real life since the change, and in situations not as simple as here)
Input Dataframe
mydf = pd.DataFrame(
{
'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
},
index=range(6)
)
cat distance energy
0 A 1.20 1.80
1 A 1.50 1.95
2 A 1.74 2.04
3 B 0.82 1.25
4 B 1.01 1.60
5 C 0.60 1.01
Before:
easy to write and read, and works as expected
import numpy as np
import statsmodels.robust as smrb
from functools import partial
# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)
# renaming and specifying the aggregators at the same time
# note that I want to choose the resulting column names myself
# for example "total_xxxx" instead of just "sum"
mydf_agg = mydf.groupby('cat').agg({
'energy': {
'total_energy': 'sum',
'energy_p98': lambda x: np.percentile(x, 98), # lambda
'energy_p17': lambda x: np.percentile(x, 17), # lambda
},
'distance': {
'total_distance': 'sum',
'average_distance': 'mean',
'distance_mad': smrb.mad, # original function
'distance_mad_c1': mad_c1, # partial function wrapping the original function
},
})
results in
energy distance
total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A 5.79 2.0364 1.8510 4.44 1.480 0.355825 0.240
B 2.85 1.5930 1.3095 1.83 0.915 0.140847 0.095
C 1.01 1.0100 1.0100 0.60 0.600 0.000000 0.000
and all is left is:
# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)
Happy dance praising pandas 💃 🕺 !
After
import numpy as np
import statsmodels.robust as smrb
from functools import partial
# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)
# no way of choosing the destination's column names...
mydf_agg = mydf.groupby('cat').agg({
'energy': [
'sum',
lambda x: np.percentile(x, 98), # lambda
lambda x: np.percentile(x, 17), # lambda
],
'distance': [
'sum',
'mean',
smrb.mad, # original function
mad_c1, # partial function wrapping the original function
],
})
The above breaks because the lambda functions will all result in columns named <lambda>
which results in
SpecificationError: Function names must be unique, found multiple named <lambda>
Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore.
If one removes the lambda x: np.percentile(x, 98)
from above, we get the same issue with the partial function which inherits the function name from the original function:
SpecificationError: Function names must be unique, found multiple named mad
Finally, after overwriting the __name__
attribute of the partial (for example with mad_c1.__name__ = 'mad_c1'
) we get:
energy distance
sum <lambda> sum mean mad mad_c1
cat
A 5.79 1.8510 4.44 1.480 0.355825 0.240
B 2.85 1.3095 1.83 0.915 0.140847 0.095
C 1.01 1.0100 0.60 0.600 0.000000 0.000
with still
- one column missing (98th percentile)
- the handling of the MultiIndex columns
- and the renaming of the columns
to deal with in separate step.
There is no control possible for the column names after aggregation, the best we can get in an automated way is some combination of original column name and the aggregate function’s name like this:
mydf_agg.columns = ['_'.join(col) for col in mydf_agg.columns]
which results in:
energy_sum energy_<lambda> distance_sum distance_mean distance_mad distance_mad_c1
cat
A 5.79 1.8510 4.44 1.480 0.355825 0.240
B 2.85 1.3095 1.83 0.915 0.140847 0.095
C 1.01 1.0100 0.60 0.600 0.000000 0.000
and if you really need to have different names, you can do it like this:
mydf_agg.rename({
"energy_sum": "total_energy",
"energy_<lambda>": "energy_p17",
"distance_sum": "total_distance",
"distance_mean": "average_distance"
}, inplace=True)
but that means that you need to be careful to keep the renaming code (which must now be located at another place in the code) in sync with the code where the aggregation is defined…
Sad pandas user 😢 (which still loves pandas of course)
I am all in for consistency, and at the same time I deeply regret the deprecation of the aggregate and rename functionality. I hope the examples above make the pain points clear.
Possible solutions
- Un-deprecate the dict-of-dict relabeling functionality
- Provide another API to be able to do it (but why should there be two methods for the same main purpose, namely aggregation?)
- ??? (open to suggestions)
Optional read:
With respect to the aforementioned discussion in the pull request which has been going on already for a few months, I only recently realized one of the reasons why I am so bothered by this deprecation: “aggregate and rename” is a natural thing to do with GROUP BY aggregations in SQL since in SQL you usually provide the destination column name directly next to the aggregation expression, e.g. SELECT col1, avg(col2) AS col2_mean, stddev(col2) AS col2_var FROM mytable GROUP BY col1
.
I’m not saying that Pandas should necessarily provide the same functionalities as SQL of course. But the examples provided above demonstrate why the dict-of-dict API was in my opinion a clean and simple solution to many use-cases.
(* I don’t personally agree that the dict-of-dict approach is complex.)
Issue Analytics
- State:
- Created 6 years ago
- Reactions:56
- Comments:37 (31 by maintainers)
Top GitHub Comments
I agree that renaming with the current
agg
implementation is very clunky and broken in this example. The nested dicts are somewhat complex but writing them as you did makes it very clear what is happening.I suppose there could be a
names
parameter added toagg
which would take dictionary mapping the aggregating columns to their new names. You could even add another parameterdrop_index
as a boolean to determine whether to keep the upper index level.So the syntax would turn into:
Or maybe, an entire new method
agg_assign
could be created, which would work similarly toDataFrame.assign
:I actually like this option much better.
For what it’s worth, I am also strongly in favour of not depreciating the functionality.
A big reason for me is that there is something deeply queer about mixing the Python’s function name-space (something to do with the particular implementation) with the data the column names (something that should surely not know about the implementation). The fact that we are seeing columns (potentially multiple columns) named
'<lambda>'
is causes me severe cognitive dissonance.The renaming approach grates, because there is this intermediary step where unnecessary (and exposed) column names are carried around. Furthermore, they are difficult to reliably, systematically rename because there are potentially dependencies on the implementation.
Aside from that, the nested dict functionality is admittedly complex, but it is a complex operation that is being performed.
TL;DR Please don’t depreciate. 😃