DataFrame.GroupBy.apply has unexpected returns in some cases
See original GitHub issueCode Sample, a copy-pastable example if possible
# Your code here
def agg(a, b):
return a + b
x = pd.DataFrame({'A': np.arange(10), 'B': [1] * 10, 'C': np.random.rand(10), 'D': np.random.rand(10)}).set_index(['A', 'B'])
x.groupby('B').apply(lambda g: g.C + g.D)
Problem description
This returns a DataFrame of shape (1, 10)
This seems to occur when i) The groupby key happens to have a unique value ii) The apply function takes a DataFrame and returns a Series.
Expected Output
I expect it returns a Series of shape (10, 1)
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-72-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 0.25.1 numpy : 1.17.2 pytz : 2019.3 dateutil : 2.8.0 pip : 19.2.3 setuptools : 41.4.0 Cython : 0.29.13 pytest : 5.2.1 hypothesis : None sphinx : 2.2.0 blosc : None feather : None xlsxwriter : 1.2.1 lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.8.0 pandas_datareader: 0.8.1 bs4 : 4.8.0 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : 2.7.0 odfpy : None openpyxl : 3.0.0 pandas_gbq : None pyarrow : 0.13.0 pytables : None s3fs : None scipy : 1.3.1 sqlalchemy : 1.3.9 tables : 3.5.2 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.1
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:6 (3 by maintainers)
It looks like there’s a check for this type of case here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L1268 that sends the code down separate paths depending on the number of unique values. In the case where we have one the result is explicitly unstacked here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L1306.
I’m not sure if this is by design with some other situation in mind but I’d agree the shape of the output shouldn’t depend on the cardinality of the thing you’re grouping on.
My bad workaround right now is a simple loop - but I really don’t like it, its less pythonic and much slower in case of many groups. How ever, modifying the internals of pandas is not an option to me. In my point of view, this makes apply simply not usable for productive data-pipelines.