Inconsistent behavior when using GroupBy and pandas.Series.mode
See original GitHub issueCode Sample
import pandas
# Works great
df1 = pandas.DataFrame([[20,'A'],[20,'B'],[10,'C']])
gb1 = df1.groupby(0).agg(pandas.Series.mode)
display(gb1)
# Exception: Must produce aggregated value
df2 = pandas.DataFrame([[20,'A'],[20,'B'],[30,'C']])
gb2 = df2.groupby(0).agg(pandas.Series.mode)
display(gb2)
Problem Description
As it seems, the above code works great for df1, returning the following result:
1
0
10 C
20 [A, B]
(where C is a str, and [A, B] is a numpy.ndarray)
However, it doesn’t work for df2, throwing the following exception:
...
C:\ProgramData\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby\generic.py in _aggregate_named(self, func, *args, **kwargs)
907 output = func(group, *args, **kwargs)
908 if isinstance(output, (Series, Index, np.ndarray)):
--> 909 raise Exception('Must produce aggregated value')
910 result[name] = self._try_cast(output, group)
911
Exception: Must produce aggregated value
It looks like the order of the processing of the agg function affects pandas error checking; if the first result (GroupBy row) returns an numpy.ndarray - the above exception is thrown, but if the first result return a str/scalar - the processing continues and suppresses further such exceptions.
In my opinion, item may be related to #2656 or one of its root causes (“fast apply vs. old pathway” as they put it), and to #24016 (which fails to accept numpy.ndarray as a return value). However, in our case - where pandas.Series.mode sometimes returns a scalar and sometimes a numpy.ndarray, it is more illusive and more confusing and therefore inconsistent (I spent ~2 hours debugging this trying to understand why so many of my agg function calls work but only one doesn’t).
Expected Output
1
0
20 [A, B]
30 C
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None
pandas: 0.24.1 pytest: None pip: 19.0.1 setuptools: 40.4.3 Cython: None numpy: 1.15.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.0 openpyxl: 2.5.12 xlrd: 1.2.0 xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:9 (4 by maintainers)
As explained in the https://github.com/pandas-dev/pandas/issues/19254 df2.groupby(0)[1].apply(lambda x: list(x.mode())) is really slow so it would be beneficial to add a separate groupby.mode() function implemented in the Cython. Such functionality is very often used for the categorical data so I am supprised that it still has not been implemented in Pandas.
this is an open issue
community patches are how pandas is updated