Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent behavior when using GroupBy and pandas.Series.mode

See original GitHub issue

Code Sample

import pandas

# Works great
df1 = pandas.DataFrame([[20,'A'],[20,'B'],[10,'C']])
gb1 = df1.groupby(0).agg(pandas.Series.mode)
display(gb1)

# Exception: Must produce aggregated value
df2 = pandas.DataFrame([[20,'A'],[20,'B'],[30,'C']])
gb2 = df2.groupby(0).agg(pandas.Series.mode)
display(gb2)

Problem Description

As it seems, the above code works great for df1, returning the following result:

(where C is a str, and [A, B] is a numpy.ndarray)

However, it doesn’t work for df2, throwing the following exception:

...
C:\ProgramData\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby\generic.py in _aggregate_named(self, func, *args, **kwargs)
    907             output = func(group, *args, **kwargs)
    908             if isinstance(output, (Series, Index, np.ndarray)):
--> 909                 raise Exception('Must produce aggregated value')
    910             result[name] = self._try_cast(output, group)
    911 

Exception: Must produce aggregated value

It looks like the order of the processing of the agg function affects pandas error checking; if the first result (GroupBy row) returns an numpy.ndarray - the above exception is thrown, but if the first result return a str/scalar - the processing continues and suppresses further such exceptions.

In my opinion, item may be related to #2656 or one of its root causes (“fast apply vs. old pathway” as they put it), and to #24016 (which fails to accept numpy.ndarray as a return value). However, in our case - where pandas.Series.mode sometimes returns a scalar and sometimes a numpy.ndarray, it is more illusive and more confusing and therefore inconsistent (I spent ~2 hours debugging this trying to understand why so many of my agg function calls work but only one doesn’t).

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.24.1 pytest: None pip: 19.0.1 setuptools: 40.4.3 Cython: None numpy: 1.15.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.0 openpyxl: 2.5.12 xlrd: 1.2.0 xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:9 (4 by maintainers)

Top GitHub Comments

8reactions

j-musialcommented, Mar 15, 2019

As explained in the https://github.com/pandas-dev/pandas/issues/19254 df2.groupby(0)[1].apply(lambda x: list(x.mode())) is really slow so it would be beneficial to add a separate groupby.mode() function implemented in the Cython. Such functionality is very often used for the categorical data so I am supprised that it still has not been implemented in Pandas.

1reaction

jrebackcommented, Mar 11, 2022

this is an open issue

community patches are how pandas is updated

Top Results From Across the Web

Inconsistent behaviour of groupby.apply() - python

It returns a dataframe instead of a series when there is only one group. df = pd.DataFrame({'A': ['a', 'a', 'a ...

What's new in 1.4.0 (January 22, 2022) - Pandas

Inconsistent date string parsing# · Ignoring dtypes in concat with empty or all-NA columns# · Null-values are no longer coerced to NaN-value in...

What's new in 1.5.0 (September 19, 2022) - Pandas

These are bug fixes that might have notable behavior changes. Using dropna=True with groupby transforms#. A transform is an operation whose result ...

Version 0.21.0 (October 27, 2017) — pandas 1.5.0 documentation

The behavior of sum and prod on all-NaN Series/DataFrames is now consistent and no longer depends on whether bottleneck is installed, and sum...

pandas.Series.all — pandas 1.5.2 documentation

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series. skipnabool, default True....